Stochastic modeling of spatial distributed sequences

ABSTRACT

Apparatus for building a stochastic model of a spatially related data sequence, the data sequence comprising symbols selected from a finite symbol set, the apparatus comprising: an input for receiving said data sequence, a tree builder for expressing said symbols as a series of counters within nodes, each node having a counter for each symbol, each node having a position within said tree, said position expressing a symbol sequence and each counter indicating a number of its corresponding symbol which follows a symbol sequence of its respective node, and a tree reducer for reducing said tree to an irreducible set of conditional probabilities of relationships between symbols in said input data sequence. The tree may then be used to carry out a comparison with a new data sequence to determine a statistical distance between the old and the new data sequence.

FIELD OF THE INVENTION

[0001] The present invention relates to stochastic modeling of sequencesand more particularly but not exclusively to modeling of data sequencesrepresenting spatial distributions using stochastic techniques, andagain but not exclusively for using the resultant models for analysis ofthe sequence.

BACKGROUND OF THE INVENTION

[0002] Data sequences often contain redundancy, context dependency andstate dependency. Often the relationships within the data are complex,non-linear and unknown, and the application of existing control andprocessing algorithms to such data sequences does not generally lead touseful results.

[0003] Statistical Process Control (SPC) essentially began with theShewhart chart and since then extensive research has been performed toadapt the chart to various industrial settings. Early SPC methods werebased on two critical assumptions:

[0004] i) there exists a priory knowledge of the underlying distribution(often, observations are assumed to be normally distributed); and

[0005] ii) the observations are independent and identically distributed(i.i.d.).

[0006] In practice, the above assumptions are frequently violated inmany industrial processes.

[0007] Current SPC methods can be categorized into groups using twodifferent criterea as follows:

[0008] 1) methods for independent data where observations are notinterrelated versus methods for dependent data;

[0009] 2) methods that are model-specific, requiring a prioriassumptions on the process characteristics and its underlyingdistribution, and methods that are model-generic. The latter methods tryto estimate the underlying model with minimum a priori assumptions.

[0010]FIG. 1 is a chart of relationships between different SPC methodsand includes the following:

[0011] Information Theoretic Process Control (ITPC) is anindependent-data based and model-generic SPC method proposed by Alwan,Ebrahimi and Soofi (1998). It utilizes information theory principles,such as maximum entropy, subject to constraints derived from dynamics ofthe process. It provides a theoretical justification for the traditionalGaussian assumption and suggests a unified control chart, as opposed totraditional SPC that require separate charts for each moment.

[0012] Traditional SPC methods, such as Shewhart, Cumulative Sum (CUSUM)and Exponential Weighted Moving Average (EWMA) are for independent dataand are model-specific. It is important to note that these traditionalSPC methods are extensively implemented in industry. The independenceassumptions on which they rely are frequently violated in practice,especially since automated testing devices increase the samplingfrequency and introduce autocorrelation into the data. Moreover,implementation of feedback control devices at the shop floor level tendsto create structured dynamics in certain system variables. Applyingtraditional SPC to such interrelated processes increases the frequencyof false alarms and shortens the ‘in-control’ average run length (ARL)in comparison to uncorrelated observations. As shown later in thissection, these methods can be modified to control autocorrelated data.

[0013] The majority of model-specific methods for dependent data aretime-series based. The underlying principle of such model dependentmethods is as follows: assuming a time series model family can bestcapture the autocorrelation process, it is possible to use that model tofilter the data; and; then apply traditional SPC schemes to the streamof residuals. In particular, the ARIMA (Auto Regressive IntegratedMoving Average) family of models is widely applied for the estimationand filtering of process autocorrelation. Under certain assumptions, theresiduals of the ARIMA model are independent and approximately normallydistributed, to which traditional SPC can be applied. Furthermore, it iscommonly conceived that ARIMA models, mostly the simple ones such asAR(1), can effectively describe a wide variety of industry processes.

[0014] Model-specific methods for autocorrelated data can be furtherpartitioned into parameter-dependent methods that require explicitestimation of the model parameters, and to parameter-free methods, wherethe model parameters are only implicitly derived, if at all.

[0015] Several parameter-dependent methods have been proposed over theyears for autocorrelated data. Alwan and Roberts (1988), proposed theSpecial Cause Chart (SCC) in which the Shewhart method is applied to thestream of residuals. They showed that the SCC has major advantages overShewhart with respect to mean shifts. The SCC deficiency lies in theneed to explicitly estimate all the ARIMA parameters. Moreover, themethod performs poorly for a large positive autocorrelation, since themean shift tends to stabilize rather quickly to a steady state value,and the shift is poorly manifested on the residuals (see Wardell,Moskowitz and Plante (1994) and Harris and Ross (1991)).

[0016] Runger, Willemain and Prabhu (1995) implemented traditional SPCfor autocorrelated data using CUSUM methods. Lu and Reynolds (1997,1999) extended the method by using the EWMA method with a smalldifference. Their model had a random error added to the ARIMA model. Thedrawback of these models is in the exigency of an explicit parameterestimation and estimation of their process-dependence features. It wasdemonstrated in Runger and Willemain (1995) that for certainautocorrelated processes, the use of traditional SPC yields an improvedperformance in comparison to ARIMA-based methods.

[0017] The Generalized Likelihood Ratio Test—GLRT—method proposed byApley and Shi (1999) takes advantage of residuals transient dynamics inthe ARIMA model, when a mean shift is introduced. The generalizedlikelihood ratio may be applied to the filtered residuals. The methodmay be compared to the Shewhart, CUSUM and EWMA methods forautocorrelated data, inferring that the choice of the adequatetime-series based SPC method depends strongly on characteristics of thespecific process being controlled. Moreover, in Apley and Shi (1999) andin Runger and Willemain (1995) it is emphasized in conclusion thatmodeling errors of ARIMA parameters have strong impacts on theperformance (e.g., the ARL) of parameter-dependent SPC methods forautocorrelated data. If the process can be accurately defined by anARIMA time series, the parameter independent SPC methods are superior incomparison to non-parametric methods since they allow efficientstatistical analysis. If such a definition is not possible, then theeffort of estimating the time series parameters becomes impractical.Such a conclusion, amongst other reasons, triggered the development ofparameter-free methods to avoid the impractical estimation oftime-series parameters.

[0018] A parameter-free model was proposed by Montgomery and Mastrangelo(1991) as an approximation procedure based on EWMA. They suggested usingthe EWMA statistic as a one step ahead prediction value for the IMA(1,1)model. Their underlying assumption was that even if the process isbetter described by another member of the ARIMA family, the IMA(1,1)model is a good enough approximation. Zhang (1998), however, comparedseveral SPC methods and showed that Montgomery's approximation performedpoorly. He proposed employing the EWMA statistic for stationaryprocesses, but adjusted the process variance according to theautocorrelation effects.

[0019] Runger and Willemain (1995, 1996) discussed the weighted batchmean (WBM) and the unified batch mean (UBM) methods. The WBM methodassigns weights for the observations mean and defines the batch size sothat the autocorrelation among batches reduces to zero. In the UBMmethod the batch size is defined (with unified weights) so that theautocorrelation remains under a certain level.

[0020] Runger and Willemain demonstrated that weights estimated from theARIMA model do not guarantee a performance improvement and that it isbeneficial to apply the simpler UBM method. In general, parameter-freemethods do not require explicit ARIMA modeling, however, they are allbased on the implicit assumption that the time-series model is adequateto describe the process. While this can be true in some industrialenvironments, such an approach cannot capture more complex andnon-linear process dynamics that depend on the state in which the systemoperates, for example processes that are described by Hidden MarkovModels (HMM) (see Elliot, Lakhdaraggoun and Moore (1995)).

[0021] The problem of Pattern Classification

[0022] In general, the goal of pattern recognition, according toTherrien (1989), is to classify objects of interest into one of a numberof categories or classes. The objects of interest are called patterns,and they may be printed as letters or characters, biological cells,electronic wave-forms or signals, “states” of a system or any number ofother things that one may desire to classify. If there exists some setof labeled patterns, namely their class are known, then one has aproblem in supervised pattern recognition. The basic procedure followedin design of a supervised pattern recognition system involves a portionof a set of labeled patterns being extracted and used to derive aclassification algorithm. These patterns are called the training set.The remaining patterns are then used to test the classificationalgorithm; these patterns are collectively referred to as the test set.Since the correct classes of the individual patterns in the test set arealso known, the performance of the algorithm can be evaluated. Insupervised pattern recognition problems, the results are preferablyevaluated by a “teacher” or “supervisor” whose output dictates suitablemodifications to the algorithm—hence the term supervised patternrecognition. Once a desired level of performance is achieved (which ismeasured in terms of a misclassification rate), the algorithm can beused on initially unlabeled patterns. At this point, the feedback loopinvolving the teacher is formally broken. Nonetheless it is usuallyadvisable to have some spot-checking of results. Such checks can beaccommodated either by providing an alternative classification algorithmor a human observer if possible. In some situations it may be feasibleto wait a certain length of time until the correct classification isknown. If the classes of all of the available patterns are unknown, andperhaps even the number of these classes is unknown, then one has aproblem in unsupervised pattern recognition or clustering. In clusteringproblems, one attempts to find classes of patterns with similarproperties where sometimes even these properties may be undefined. Theunsupervised pattern recognition or clustering problem is a much moredifficult one than the supervised pattern recognition problem.Nevertheless, useful algorithms have been developed in this area andsuccess depends to a large extent on the ability to learn the structureof pattern measurement data in high-dimensional spaces. The presentdisclosure focuses on a supervised pattern recognition scheme.

[0023] The Patterns Recognition Approach:

[0024] In the typical pattern recognition approach, observations firstundergo feature transformation and then classifcation in order to arriveat an output decisions. An observation vector x is first transformed, bythe feature transformation, into another vector y whose components arecalled features. The features are intended to be fewer in number thanthe observations but should collectively contain most of the informationneeded for classification of the patterns. By reducing the observationsto a smaller number of features, one hopes to design a decision rulethat is more reliable. The feature vector y can be represented in afeature space Y similar to the way that observation vectors arerepresented in the observation space. The dimension of the featurespace, however, is usually much lower than the dimension of theobservation space. Procedures that analyze data in an attempt to defineappropriate features are called feature extraction procedures. Thefeature vector y is passed to a classifier whose purpose is to make adecision about the pattern. The classifier essentially induces apartitioning of the feature space into a number of disjoint regions. Ifthe feature vector corresponding to a pattern falls into region Ri, thepattern is assigned to class Wi.

[0025] In general, the symbol x is used herein to represent observationvectors and y is used to represent feature vectors.

[0026] There are several ways to perform patterns recognitions. Weclassify the pattern recognition methods into different classes, asshown in the tree depicted in attached FIG. A1, which is in compliancewith Duda et al (2001). We will detail those branches in the tree thatare related to the present disclosure.

[0027] The first classification is between supervised patternrecognition vs. unsupervised pattern recognition:

[0028] In supervised pattern recognition, the types and the number ofthe existing classes are known. In addition, the classes in the trainingset are tagged.

[0029] By contrast, in unsupervised pattern recognition, the classes ofall of the available patterns are unknown, and in some cases even thenumber of these classes is unknown. Consequently, in such situation theclasses in the training set are not tagged and the problem becomes aclustering problem.

[0030] The present disclosure concerns problems of supervised patternrecognition, since, as will be explained below in the description of thespecific embodiments, the construction algorithm may make use of thedifferent tagged classes in the training set to generate a differentcontext-tree model for each class, for example, in the promoterrecognition problem there are two tagged classes: “promoters” and“non-promoters”. We thus continue to detail the supervised patternrecognition branch.

[0031] The second classification distinguishes between statistical andlogical methods.

[0032] Logical Methods are usually used when the classification problemsinvolves nominal data, for instance description, that are discrete andwithout any natural notion of similarity or even ordering (Duda et al(2001)). The decision tree is an example of a logical method. Thisbranch is irrelevant to the present disclosure.

[0033] Statistical Methods use statistical tools and they are based onfeature vectors of real-valued and discrete-valued numbers. There can bea natural measure of distance between theses vectors. In this category,which is relevant to the present disclosure, we make another distinctionbetween Unknown probabilistic models and Known probabilistic models.

[0034] In unknown probabilistic models, the underlying probabilisticmodel is unknown. In many cases researchers make use of discriminantfunction to address these types of problems. Since we assume that ageneral context-tree model can well represent the different classes(although the parameters of the tree are unknown and need to beestimated from the training set), we do not consider this branch ofmethods.

[0035] Known probabilistic models—the distribution function or a generalprobabilistic model, such as transition probabilistic tree, is assumedknown. We assume that a general context-tree model can well representthe different classes. In this category, which is relevant to thepresent disclosure, we distinguish between the following two types ofmodels:

[0036] Known parameters—models based on known parameters. This is oftenthe easiest albeit the more rare problem. In this case, researchestypically use Bayesian decision theory to classify the unknown object.

[0037] Unknown parameters—in these cases, researches often estimate theparameters by known methods such as the maximum likelihood estimation(where parameters are assumed to be fixed), Bayesian estimation (whereparameters are assumed to be random variable), and Gibbs sampling. Tothis branch of methods the present disclosure belongs. This branchincludes some other state-dependent models such as: and Markov models,Hidden Markov Models, Neural nets etc. Note that once the parameters ofthe model are estimated then conventional methods of classification canbe used such as those based on Bayesian decision theory.

[0038] Giving the above classification, note that Markov models are theclosest methods to the suggested disclosure presented here. In thefollowing, we briefly sketch the Markov models.

[0039] Markov Models

[0040] Markov models are based on a finite memory assumption, i.e., thateach symbol depends only on its k formers, where k is fixed. Thesimplest model is first-order Markov model, which assume that eachsymbol at time t depends only on the symbol at time t−1:P(x_(i)=W(i)|x₁=W(1), x₂=W(2), . . . ,x_(i−1=W(i−)1))=P(x_(i)=W(i)|x_(i−1)=W(i−1)), where state i at time t isdenoted by W_(i)(t).

[0041] In order to calculate the probability that the model generates aparticular sequence, the successive probabilities should simply bemultiplied.

[0042] Markov models of higher order simply extend the size of thememory. The suggested methods of the present disclosure can be viewed asa varying-order Markov model, since the order of the memory doesn't haveto be fixed as explained latter.

[0043] In general, Markov Models assume that the states are accessible.In many cases, however, the perceiver does not have access to thestates. Consequently, Markov Model should be augmented to Hidden MarkovModel, which is a Markov model with invisible states. Hidden Markovmodels have a number of parameter whose values are set so as to bestexplain training patterns for the known category.

[0044] An alternative model to the Markovian is the context-tree thatwas suggested by Rissanen (1983) for data compression purposes andmodified later in Weinberger, Rissanen and Feder (1995) and in Ben-Galet al (2000, 2001). The tree presentation of a finite-memory source isadvantageous since states are defined as contexts—graphicallyrepresented by branches in the context-tree with variable length—andhence, requires less estimation efforts than those required for a Markovpresentation. The context-tree is an irreducible set of conditionalprobabilities of output symbols given their contexts. The tree isconveniently estimated by context algorithm. The algorithm generates anasymptotically minimal tree fitting the data. The attributes of thecontext-tree along with the ease of its estimation make it suitable fora model-generic classifier, as explain later.

[0045] Patterns Classification in Biology

[0046] Applications of Pattern Recognition in Biology

[0047] The following is a summary of Prof. Shamirs lecture notesavailable at http://www.math.tau.ac.il/˜rshamir/ and from Higgins andTaylor (2000). The sequences of the family members in Biology are oftencompared in order to find properties that are shared by all members andunderstand how these could explain certain biological properties.Discovering patterns in biology is widespread, and has two mainapplications: i). Classification: the patterns are to be used fordiscriminating between family members and non-members; and ii) findingpatterns that describe biologically important features.

[0048] In the following we briefly describe some of the mainapplications of patterns recognition in Biology. Note that the presentembodiments are applicable to all these problems.

Coding Sequences in Prokaryotic Gene Structure

[0049] There are more than 3 billions bases of human, albeit eukaryotic,DNA sequences and complete DNA sequences for dozens of species availablein GenBank. Not all the sequences are coding, namely are a template fora protein. In the human genome only 3%-5% of the sequences are codingsequences, and the approximate proportion applies also to prokaryoticsequences. Due to the size of the database, there is a need to find away for automatic finding of coding sequenses. The algorithm should lookfor long sequences of codons, without any stop codon. It should scan theDNA sequence, looking for long ORF's (open reading frame) in all threereading frames (the first codon can start from the first, second or thethird basic). After detecting a stop codon, the algorithm scansbackward, searching for a start codon.

[0050] It should be noticed that the coding sequence has no fixedlength, a fact that creates difficulties on the pattern recognitionalgorithm. A typical prokaryote sequence is shown in FIG. A2.

[0051] Exons in Eukaryotic Sequences

[0052] The gene structure and the gene expression mechanism ineukaryotes are far more complicated than in prokaryotes. In typicaleukaryotes, the region of the DNA coding for a protein is usually notcontinuous. This region is composed of alternating stretches of exonsand introns. During transcription, both exons and introns aretranscribed onto the RNA, in their linear order. Thereafter, a processcalled splicing takes place, in which the intron sequences are excisedand discarded from the RNA sequence. The remaining RNA segments, theones corresponding to the exons, are ligated to form the mature RNAstrand. A typical multi-exon gene has the following structure: It startswith the promoter region, which is followed by a transcribed butnon-coding region called 5′ untranslated region (5′ UTR). Then followsthe initial exon which contains the start codon. Following the initialexon, there is an alternating series of introns and internal exons,followed by the terminating exon, which contains the stop codon. It isfollowed by another non-coding region called the 3′ UTR. Ending theeukaryotic gene, there is a polyadenylation (polyA) signal: thenucleotide Adenine repeating several times. The exon-intron boundaries(i.e., the splice sites) are signaled by specific short (2 bp long)sequences. The 5′(3′) end of an intron (exon) is called the donor site,and the 3′(5′) end of an intron (exon) is called the acceptor site. FIG.A3 represents a typical eukaryote sequence structure.

[0053] Exon length does not have a geometric distribution. The lengthseems to have a functional role on the splicing itself. Typically, exonsthat are too short (under 50 bp) leave no room for the spliceosomes(enzyms that perform the splicing) to operate and exons that are toolong (above 300 bp) are being difficult to locate. Thus another modelfor exon length is required.

[0054] An algorithm for this kind of recognition, should notice thedifferences between the exons and the introns, and the connectionsbetween them. It also should take into consideration the differentlength distribution: an average internal exon is about 150 bp long,while introns of the order of 1 Kbp length are not uncommon.

[0055] Promoters

[0056] A promoter is a region of DNA to which RNA polymerase bindsbefore initiating the transcription of DNA into RNA:

[0057] Not all open reading frames are transcribed into genes. Thetranscription depends on regulatory regions that control thetranscription rate. In the transcription process, an RNA polymerasebinds tightly to the promoter. The promoter is an ‘anchor’ point, itpinpoints where RNA transcription should begin. At the stop signal thepolymerase releases the RNA and detaches itself from the DNA. We furtherdistinguish between two cases

[0058] i) Prokaryotic Promoters:

[0059] The promoter contains two remote pairs of six basics. In oneexample of our invention, which is described latter, we identify E. colypromoters. In E. coly one can find the following consensus sequencearound RNA transcription start point:nnnTTGACAnnnnnnnnnnnnnnnnnnTATAATnnnnnnNnnn. N is the transcriptionstart point. TTGACA appears 35 bases before N, and TATAAT (also known asTATA box or Pribnow box) appear 12 bases before N. We have here 2 anchorpoints for the polymerase. These sequences are short but the frequencyof their occurrence is high.

[0060] Since the consensus sequence mentioned above doesn't appear ineach promoter, an algorithm should be developed in order to recognize apattern that will suit all the promoters.

[0061] ii) Eukaryotic Promoters:

[0062] Much less is known about eukaryote promoters; each of the threeRNA polymerases has a different promoter.

[0063] RNA polymerase I recognizes a single promoter for the precursorof rRNA.

[0064] RNA polymerase II, that transcribes all genes coding forpolypeptides, recognizes many thousands of promoters. Most have theGoldberg-Hogness or TATA box that is centred around position −25 and hasthe consensus sequence 5′-TATAAAA-3′. Several promoters have a CAAT boxaround −90 with the consensus sequence 5′-GGCCAATCT-3′. There isincreasing evidence that all promoters for genes for “housekeeping”proteins contain multiple copies of a GC-rich element that includes thesequence 5′-GGGCGG-3′. Transcription by polymerase II is also affectedby more distant elements known as enhancers.

[0065] The promoter for RNA polymerase III is located within the geneeither as a single sequence, as in the 5s RNA gene, or as two blocks, asin all tRNA genes.

[0066] Splice Sites

[0067] The Splice Sites structures are: 5′ splice sites: MAG|GTRAGTwhere M is A or C and R is A or G 3′ splice sites: CAG|GT

[0068] An algorithm might predict the splice sites that distinguishbetween the exons and the introns.

[0069] Terminators

[0070] At the end of the coding sequences a signal exists (theterminator) which means “stop making RNA here”. It is composed of asequence which is rich in the bases G and C and which can form a hairpinloop. This structure is more strongly hydrogen bonded (G-C base pairsare held together by three hydrogen bonds) causing the RNA polymerase toslow down.

[0071] Poly A

[0072] Polyadenylic acid sequence of varying length found at the 3′ endof most eukaryotic mRNAs. The poly-A tail is addedpost-transcriptionally to the primary transcript as part of the nuclearprocessing of RNA yielding hnRNAs with 60-200 adenylate residues in thetail. In the cytoplasm the poly-A tail on mRNAs is gradually reduced inlength. The function of the poly-A tail is not clear.

[0073] It is useful to find the pattern of the Polyadenylic acid.

[0074] Proteins

[0075] Proteins are long chains of Amino Acids (AA). There are 20 typesof AA that serve as building blocks for proteins. Each AA has a specificchemical structure. The length of a protein chain can range from 50 to3000 AA (200 on the average). One of the interesting properties ofproteins is the unique folding. The AA composition of a protein willusually uniquely determine (on specific environment conditions) the 3Dstructure of the protein (e.g., two proteins with the same AA sequencewill have the same 3D structure in natural conditions). Researches of 3Dstructure of proteins have shown that when a folded protein isartificially stretched to a chain, it folds back to it's original 3Dstructure. Proteins are known to have many important functions in thecell, such as enzymatic activity, storage and transport of material,signal transduction, antibodies and more. All proteins whose structureis known are stored in the Protein DataBank (PDB), which contains morethan 10,000 proteins.

[0076] A protein has multiple levels of structure:

[0077] Primary structure—Chain of Amino Acids (1 dimensional).

[0078] Secondary structure—Chains of structural elements, most importantof which are α-helices and β-sheets.

[0079] Tertiary and Quaternary structure—3D structure, of a single AAchain or several chains, respectively.

[0080] The pattern recognition of proteins is done in these threelevels: primary structure (20 symbols), secondary structure and 3Dstructure.

Classification Concepts in Biology

[0081] Pattern-Based Classification Approaches

[0082] Brazma et al (1998) proposed a general three-step approach fordiscovering patterns from protein and DNA sequences:

[0083] i) Choose a solution space (a set of patterns that the methodshould discover).

[0084] ii) Define a fitness function reflecting how well a pattern fitsthe input sequence.

[0085] iii) Develop an algorithm, which gets a set of input sequences,and returns the patterns with the highest fitness

[0086] According to Jonassen (2000), there are two main groups ofsequence patterns: Deterministic patterns (Regular expression typepatterns) and Probabilistic patterns. Next, both types of patterns aredescribed.

[0087] Deterministic Patterns

[0088] In deterministic pattern methods, a sequence either matches ordoes not match the pattern.

[0089] A very simple type of patterns is a substring pattern—a sequencematches a substring pattern if it contains the substring. When matchinga sequence

[0090] against a substring pattern there are two kinds of matching:exact and approximate.

[0091] When using approximate matching, a sequence matches the patternif it contains a substring approximately equal to the pattern.“Approximately equal” means that only some of the characters are thesame.

[0092] In the approximate matching, a distance is defined between apattern and a substring, and an upper limit on the distance (athreshold) is set.

[0093] Typical ways to measure the distance between two strings are:

[0094] i) Hamming distance—counting the number of character changesneeded to transform one into the other (number of mismatches). Themeasure is relevant only for two strings of the same size.

[0095] ii) Gaps based methods. In these methods in order to match thetwo strings, it's allowed to insert or delete characters in addition totheir substitutions (namely, gaps are allowed). For ACCDDECA versusACDDECA example: Without ACCDDECA gaps: II I ACDDECA (3 matches) WithACCDDECA gaps: I IIIIII A_CDDECA (7 matches, one gap of 1 length)

[0096] In the above methods, the distance is a function of the number ofmismatches, the number of gaps and their length.

[0097] Alternatively, there are methods that define a score between apattern and a sub string and set a lower limit on the score to beallowed.

[0098] The advantages of the deterministic patterns are that they arevery simple, mathematically pure and easy to interpret. Its disadvantageis that often, there is more one pattern that matches all the familymembers.

[0099] Probabilistic Patterns

[0100] In these patterns for each position in the pattern, a score isassigned to each of the symbols (e.g., four basics in D.N.A, or twentyamino acids in proteins). Additionally, penalties (or probabilities) toinsertions or deletions in each pattern position are assigned. These canalso be seen as generalization of substring patterns. These methodsassign a score (probability) to a match to a sequence.

[0101] This category contains: Profiles (Position Specific ScoringMatrixes), HMM (Hidden Markov Models) and Probabilistic Trees. All thesemethods are relative methods to the suggested algorithm as furtherdetailed below.

[0102] In the probabilistic patterns the probabilities are oftencalculated from:

[0103] i) the distribution of the symbols (e.g., amino acids or basics)in the columns of the probability matrices.

[0104] ii) some external information from substitution matrices.

[0105] iii) Dirichlet mixtures.

[0106] The invention presented here derives the probability measures ofsymbols based on their type, position and context. However, the set ofprobabilities to be used for classification is determined by thealgorithm in a different manner than the above-mentioned methods.

[0107] The advantages of probabilistic patterns are that they can beused when it is not possible to define one single regular expressiontype pattern, which matches all family sequences. They can assigndifferent scores and different gap penalties to each symbol in differentpattern position. Their disadvantages are that they contain many moreparameters to be estimated, and in order to estimate their values, alarge number of family members is needed. In addition, noisy examplescan enter the learning process, and therefore unmatched sequences willbe recognized as matched (see Higgins and Taylor, 2000, the contents ofwhich are hereby incorporated by reference.).

[0108] Pattern and Sequence Driven Algorithms

[0109] Branza et al (1998) identified two main algorithmic approaches:Pattern Driven (PD) methods and Sequence Driven (SD) methods.

[0110] Pattern driven methods: In the simplest form, PD methodsenumerate all patterns in the solution space calculating each pattern'sfitness so that the best ones can be output. There are moresophisticated methods that contain mechanisms for avoiding looking atall patterns in the solution space. Sequence driven methods: In thesemethods different pairs of sequences are compared in order to findsimilarities as patterns. These methods are very similar to the sequencealignment methods. Examples for sequence driven methods are:

[0111] i) Smith and Smith (1990) developed a method that first computesthe similarity between all pairs of input sequences. The most similarpair is input to a local pair-wise alignment method that is based ondynamic programming. This algorithm outputs a pattern common to the twosequences. The two sequences aligned are replaced by their commonpattern and the procedure is repeated until there remains only onepattern matching all the input sequences.

[0112] ii) The Pratt programs gets as input a set of sequences and findspatterns matching at least a minimum number of the sequences that isdefined by the user. The user can also input constrains on the patterns(the solution space). Pratt uses a two steps search (initial patternsearch and pattern refinement) for finding conserved patterns (patternsthat match at least the minimum number of the sequences) from the chosensolution space having maximum fitness.

[0113] The invention presented here is related to sequence drivenmethods, since patterns are not enumerated. Instead, the trainingsequence are used to construct the tree model which representprobabilistically a set of patterns.

[0114] Fitness Functions

[0115] Fitness functions usually reflect how well a pattern fits theinput sequence. Some fitness function are designed according to thefollowing principles:

[0116] i) According to Jonassen (2000), information content of a patternis a measure of the information gained about an unknown sequence whenone is told that the sequence matches the pattern.

[0117] ii) Brazma et al (1996) described the minimum description length(MDL) principle, which assigns a score to a pattern that depends on thepattern's information content and on how many sequences it matches. Theuser can select parameters in order to slant the optimum towards strongpatterns matching few sequences or towards weaker patterns matching manysequences.

[0118] iii) PPV (Positive Predictive Value)—It is usually used by thePratt program when the aim is to find patterns to be used forclassification.

[0119] The invention presented here makes use of the tree model toderive a fitness function to each sequence.

[0120] Training the Model

[0121] In methods of supervised pattern-recognition (and particularly inthose related to biology pattern recognition), a portion of the set oflabeled patterns is extracted and used to derive a classificationalgorithm. These patterns comprise the training set.

[0122] The remaining patterns are then used to test the classificationalgorithm; these patterns are referred to as the test set. Since thecorrect classes of the individual patterns in the test set are alsoknown, the performance of the algorithm can be evaluated. There is atradeoff between the classification of the training samples and theperformance of the algorithm on new objects. Perfect classification ofthe training samples can be achieved, but in this case new objects thatwere not part of the training set usually are not well recognized. Thissituation is known as overfitting, and it should be avoided. Thus, It isvery important to determine how to adjust the complexity of the model:it shouldn't be so simple that it cannot explain the differences betweenthe categories, yet, not so complex as to give poor classification onnovel patterns. There are several ways to evaluate the algorithm. Two ofthem are:

[0123] iv) Parametric Models: The generalization error rate is computedfrom the assumed parametric model. A test set is not needed in thesemodels.

[0124] v) Simple Cross-Validation: The set of labeled samples D aresplit into two parts, as mentioned above. One part is used as thetraditional training set. The other part is the validation set, which isused to estimate the error rate. The classifier is trained until onereaches a minimum of this error.

[0125] vi) General Cross-Validation: in more detail.general crossvalidation methods, the performance is based on multiple, independentlyformed validation sets. For example, in the m-fold cross validation, thetraining set is randomly divided into m disjoint sets of equal size. Theclassifier is trained m times, each time with a different set held outas a validation set. The estimated performance is the mean of these merrors. In the Anti-cross validation the adjustment of parameters ishalted when the validation error is the first local maximum.

[0126] Conventional Performance Measure (taken from Jonassen, 2000)

[0127] When a pattern is to be used for classification, it shouldideally match all family members and no other sequences. Most often,however, the pattern fails to match some member sequences (called falsenegatives), and it may match some sequences outside the family (calledfalse positives). The fewer false negatives, the more sensitive thepattern is said to be, and the fewer false positives, the more specificit is. Ideally, a pattern should have zero false positives andnegatives.

[0128] An estimate of the number of matches in a sequence database canbe found by multiplying the probability that one random sequence matchesthe pattern by the number of sequences in the database. In order tocalculate the probability, it is often assumed that random sequences aregenerated using a specific probabilistic model. Sternberg (1991)considered all the patterns in the PROSITE database and obtained a clearcorrelation between the expected number of false positives and theactual number, i.e., the number of unrelated sequences in the SWISS-PROTdatabase matching the pattern.

[0129] Denoting the number of true positives (sequences in the familymatching the pattern) by TP and the number of false negatives by FN, thesensitivity of a pattern can be defined as,

Sensitivity=TP/(TP+FN).

[0130] The sensitivity measures of how big a proportion of the familysequences are ‘picked up by’ (matched by) the pattern. Similarly, thespecificity of the pattern can be defined as,

Specificity=TN/(TN+FP),

[0131] (where TN and FP are, respectively, the number of true negativesand the number of false positives) which measures of how big aproportion of the sequences outside the family are not matched by thepattern. Yet, another useful measure is the positive predictive value(PPV), which determines how big a proportion of the sequences matchingthe pattern are actually in the family, i.e.,

PPV=TP/(TP+FP).

[0132] The value range for all the above measures is from zero toone—one being the best possible. When evaluating patterns to be used forclassification, one needs to use more than one of the measures. This canbe illustrated by two degenerate cases, (1) the empty pattern matchingany member, and (2) a pattern matching one single member in the family.Pattern (1) has perfect sensitivity, but very bad specificity and PPV,while pattern (2) has perfect specificity and PPV, but bad sensitivity.In practice, one often needs to make a trade-off between sensitivity andspecificity when choosing which pattern to use for a family. One way toevaluate a probabilistic pattern's ability to discriminate betweenfamily members and other sequences is to find a cut-off on the scorethat gives the same number of false positives and false negatives.Tatusov et al. (1994) evaluated alternative ways of finding weightmatrices from local ungapped alignments using this approach. Anotherapproach is to achieve maximum TP given very high TN. This approachrelies on the assumption that the proportion of the “negatives” in thepopulation is very high.

[0133] Used Models for Pattern Classification in Biology

[0134] Some general probabilistic methods are often used to recognize anumber of families in biology. Some of the most common methods sharingthe same idea are Profiles (Position Specific Scoring), PWM (PositionalWeight Matrix) and WMM (Weight Matrix Model). This methods are oftenreferred as Non-Homogeneous models, which mean that the distribution ofthe symbols is different between one position in the pattern to theother.

[0135] Gribskov et al (1987) initially suggested the profiles Matrixes.Profile is a scoring matrix giving a position specific scoring andspecific gap penalties for each symbol (amino acids or basics). Thematrix, known as the PWM, is a table of statistics, f_(b,i), of thefrequencies of the symbol b in position i of the known sequences (e.g.,promoter or coding). This model assumes that positions are independent.GENSCAN uses different signal models to model different functionalunits. One of the models is WMM in which every position has its ownspecific independent distribution. It is used for modelingpolyadenylation signals, translation initiation signal, translationtermination signal and promoters.

[0136] Another model is the weighted array model (WAM). The WAM model isa generalization of the WMM model that allows dependencies betweenadjacent positions. The WAM model is used for the recognition of thesplice sites. Correct recognition of these sites greatly enhancesability to predict correct exon boundaries. This modeling of splicesites gave GENESCAN a substantial improvement in performance. This modelcan be seen as the extension of HMM, since each position has its own HMMnetwork.

[0137] As was mentioned before, a hidden Markov models (HMM) is a Markovchain in which the states are not directly observable. Instead, theoutput of the current state is observable. The output symbol for eachstate is randomly chosen from a finite output alphabet according to someprobability distribution.

[0138] Ohler et al (1999) used three interpolated Markov Chains ofdifferent order, which are trained on coding, non-coding and promotersequences.

[0139] A Generalized Hidden Markov Model (GHMM) generalizes the HMM asfollows: in a GHMM, the output of a state may not be a single symbol.Instead, the output may be a string of finite length. For a particularcurrent state, the length of the output string as well as the outputstring itself might be randomly chosen according to some probabilitydistribution. The probability distribution need not be the same for allstates. For example, one state might use a weight matrix model forgenerating the output string, while another might use a HMM. Formally aGHMM is described by a set of four parameters:

[0140] i) A finite set Q of states.

[0141] ii) Initial state probability distribution Πq.

[0142] iii) Transition probabilities T_(i,j) for i,j∈Q.

[0143] iv) Length distributions f of the states (f_(q) is the lengthdistribution for state q).

[0144] v) Probabilistic models for each of the states, according towhich, output strings are generated upon visiting a state.

[0145] The probabilistic model for gene structure as suggested by Bergeand Karlin (1997), is based on a GHMM.

[0146] Another model is the probabilistic tree model. In theprobabilistic tree model, similarly to the Markov models (MM), theprobability of each symbol depends on its k predecessors. The deferentbetween MM and probabilistic trees is that in MM k is fixed, and in theprobabilistic trees k is changeable. In practice, k is selected toattain the shortest contexts for which the conditional probability of asymbol given the context is practically equal to the conditionalprobability of that symbol given the whole data. For example, Bejeranoand Yona (1998), used probabilistic suffix trees in order to modelprotein families. The construction of the suffix tree, theparameterization and growth is different than the tree model presentedhere (for example, the construction of suffix tree requires multi-passesas oppose to a single pass in the context tree, moreover, “partialleafs” that might have a vital importance for classification are ignoredin suffix trees). Vert (2001) used a similar tree model for textclustering. The suggested invention is a relative to these methods yetdiffers from them as indicated below.

[0147] In the following, we indicate a list/survey of models that wereinvestigated for several classification applications.

[0148] Proteins

[0149] Gelfand (1995) reviewed methods for prediction of functionalsites, tRNA and protein coding genes. Fickett and Tung (1992) classifiedthe protein coding algorithms to five groups:

[0150] i) Codon usage—using the frequencies of the 64 codons. See, forexample, Staden and MeLachlan (1982), Gribskov, Deverux and Burgess(1984), Hinds and Blake (1985), Kolaskar and Reddy (1985), Borodovsky etal (1986), Clayeric and Bougueleret (1986), Fichant and Gautier (1987),Lapedes et al (1990).

[0151] ii) Encoded amino acid sequence. See, for example, McCaldon andArgos (1988), Tramontano and Macchiato (1986), Moody and Fristensky(1987).

[0152] iii) Base compositional bias between codon positions—Thesealgorithms consider the different between the three-codon positions.See, for example, Shepherd (1981), Ficket (1982), Bibb, Findlay andJohnson (1984), Almagor (1985), Trifonov (1987).

[0153] iv) Imperfect periodicity in base occurrences. See, for example,Michel (1986), Silverman and Linsker (1986), Arques and Michel (1987),Konopka (1990).

[0154] v) Other global patterns. See, for example, Erickson and Altman(1979), Shulman, Steinberg and Westmoreland (1981), Blaisdell (1983).

[0155] Promoters

[0156] Ohler & Niemann (2001) made a review of the identification andanalysis of eukaryotic promoters:

[0157] Discovering Motifs

[0158] Ohler and Niemann (2001) divided the discovering motifs methodsinto two main categories—Alignment methods and Enumerative or exhaustivemethods.

[0159] Alignment methods aim to identify unknown signals by asignificant local multiple alignment of all sequences. Alignmentapproaches deliver a model of the motifs (such as a weight matrix) builtfrom the alignment. They require different statistics depending on howoften a pattern may be present in the sequences.

[0160] There are Direct multiple alignment methods, such the consensusalgorithm, which aligns sequences one by one and optimize theinformation content of the weight matrix constructed from the alignment.

[0161] There are also statistical approach methods. They consider thestart positions of the motifs in the sequences to be unknown and performa local optimization to determine which positions deliver the mostconserved motif. Examples of these methods are: Gibbs sampling (Lawrenceet al (1993)) and expectation maximization in the MEME system (Baileyand Elkan (1995)).

[0162] In the other hand, the enumerative or exhaustive methods aim toexamine all oligomers of a certain length and report those that occurfar more often than expected from the overall promoter sequencecomposition. These methods give a list of over-represented oligomers,possibly already grouped to form consensus sequences. They have to usean elaborate background model to judge the importance of frequentpatterns. They also need to have the size of the motif specified inadvance.

[0163] The Set of Input Data

[0164] The methods are often applied on a set of promoters that werefirst grouped together using gene expression measurements. A new way tolook at the data is to cluster genes based on both expression levels andcommon motifs. An alternative approach is to identify elements byanalyzing promoters of the same gene from approximately ten differentrelated species.

[0165] Promoters Recognition Algorithms:

[0166] Ohler & Niemann (2001) divided these algorithms into two maingroups based on different search principles:

[0167] i) Search by signal algorithms—making predictions based on thedetection of core promoter elements such as TATA box or the initiatorand/or transcription factor binding sites outside the core.

[0168] ii) Search by content algorithms—identifying regulatory regionsbased on the sequence composition of promoter and nonpromoter examples.

[0169] There are also methods that combine both ideas—looking forsignals and for regions of specific composition. Other methods and ideasfor finding the promoters are:

[0170] i) Providing an accurate prediction of the TSS (transcriptionstart site). This idea is good only for small regions known to contain apromoter (see Zhang, 1998).

[0171] ii) Providing specific prediction of regulatory regions using asearch by content approach. The method gives no information regardingwhether the affected gene is on the leading or lagging strand, or wherethe TSS itself is located within the region (see Scherf, 2000).

[0172] iii) Constructing specific, rather than general, promoter modelsfor groups of genes as muscle-active genes known by experiment-tocontain specific combination of regulatory elements (see Wasserman andFicket, 1998).

[0173] Davuluri, Grosse and Zhang (2001) presented a set of discriminantfunctions that can recognize promoters in the human genome. They explainthe implementation of these functions into a decision tree thatconstitutes a new program called FirstEF. They obtained a TP=86% withFN=17%.

[0174] Fickett and Hatzigeorgiou (1997) provides a review for theeukaryotic promoter recognition methods:

[0175] Hsu et al (1994) and Wright et al. (1991) used consensussequences—giving the most preferred base at each position within a site.This approach loses much of the information and is of marginal utility.

[0176] PWM (positional weight matrix) assigns a weight to each possiblenucleotide at each position of a putative binding site and gives as asite score the sum of these weights. PWM are more informative, and areused when enough information is available to build them.

[0177] Bucher (1990) developed an iterative algorithm for weight matrixrefinement in order to find motifs in the promoters. This kind of modelassumes nonhomogenous structure, which means that the symbolsdistribution is different between the positions in the pattern.

[0178] Interpulated HMM—In these techniques, the estimated probabilityof a sequence is the linear or other interpolation between allconditional probabilities with increasing context length. Ohler et al.(1999) used three interpolated Markov Chains of different order, whichare mainly used to recognize eukaryotic promoters. They comparedpromoters versus non promoters (coding sequences, intron sequences andboth coding and non-coding sequences. The best accuracy was achieved inpromoters versus CDS (T.N=95%, T.P.=88.9%, AC=91.95%)

[0179] Coding

[0180] Codon Frequencies in Coding Regions

[0181] An informative method to determine coding regions, takesadvantage of the frequencies at which the various codons occur in codingregions. For example, the amino acids Leucine, Alanine and Tryptophanare coded by 6, 4 and 1 different codons respectively. In a translationof a uniformly random DNA sequence, these amino acids should occur inthe ratio 6:4:1, but in a protein they occur at a differentratio—6.9:6.5:1. Therefore coding DNA is not random. Another example ofthe non-uniformity of coding DNA is the fact that A or T occurs in the3^(rd) position of a codon in a rate over 90% (these statistics vary fordifferent species).

[0182] Finding Long ORFs

[0183] Another way to distinguish coding regions from non-codingregions, is to examine the frequencies of stop codons. Assuming auniform random distribution, a stop codon is expected to be observedevery 64/3=21.33 codons (since there are 3 stop codons). Averageproteins are much longer, being coded by about 1000 bp (base pairs).Each coding region has only one stop codon, which terminates the region.Therefore, one way to detect the coding regions, is to look for longsequences of codons, without any stop codon. The algorithm that uses theabove idea scans the DNA sequence, looking for long ORFs in all threereading frames. Upon detecting a stop codon, the algorithm scansbackward, searching for a start codon. This algorithm will fail todetect very short genes, as well as overlapping long ORFs on oppositestrands. Moreover, there are a lot more ORFs than genes. For example,one can find 6500 ORFs in the DNA of the bacterium E. coli while thereare only 4400 genes.

[0184] ORFs as Markov Chains

[0185] Assuming one finds all ORFs in a sequence, he can use codonfrequencies to find which ORFs are coding and which are non coding openreading frames (NORFs). This is done by translating each ORF into acodon sequence and obtaining a 64-state Markov chain. One can use astate for each codon rather than a state for each amino acid, becausecodons are more informative than their translations (there might be apreference for a specific codon in gene expression over other codonsthat encode the same amino acid). The transition probabilities are theprobabilities for each codon to follow any other codon in a codingregion. Using this model, one can compute the probability that a givenORF is really a coding region.

[0186] Exons

[0187] Spliced Alignment

[0188] Given a genomic sequence and a set of candidate exons, thespliced alignment algorithm (see Gelfand, Mironov and Pevzner, 1996)explores all possible exons assemblies and finds a chain of exons whichbest fits a related target protein. The set of candidate exons isconstructed by considering all blocks between candidate acceptor anddonor sites (i.e., between AG dinucleotide at the intron-exon boundaryand GU dinucleotide at exon-intron boundary) and further filtration ofthis set. To avoid losing true exons, the filtration procedure isdesigned to be very gentle, and the resulting set of blocks may containa large number of false exons. Instead of trying to identify the correctexons by further pursuit of statistical methods, The algorithm considersall possible chains of candidate exons and finds a chain with themaximum global similarity to the target protein.

[0189] Network Formulation

[0190] The spliced alignment problem can be formulated in network terms.The set of blocks is represented by a set of nodes βi. A node v_(i) isconnected to a node v_(j) if β_(i)<β_(j). The requested solution is thebest alignment between the reference sequence (T) and a path in thenetwork.

[0191] Further information is available from Ben-Gal I., Shmilovici A.,Morag G., “Design of Control and Monitoring Rules for State DependentProcesses”, Journal of Manufacturing Science and Production, 3, NOS.2-4, 2000, pp. 85-93; also Ben-Gal I., Morag G., Shmilovici A.,“Statistical Control of Production Processes via Context Monitoring ofBuffer Levels”, submitted (after revision); Ben-Gal I., Singer G.,“Integrating Engineering Process Control and Statistical Process Controlvia Context Modeling”, submitted, (after revision); Shmilovici A.Ben-Gal I., “Context Dependent ARMA Modeling”, Proc. of the 21st IEEEConvention, Tel-Aviv, Israel, Apr. 11-12, 2000, pp. 249-252; Morag G.,Ben-Gal I., “Design of Control Charts Based on Context Universal Model”,Proc. of the Industrial Engineering and Management Conference,Beer-Sheva, May 3-4, 2000, pp. 200-204; Zinger G., Ben-Gal I., “AnInformation Theoretic Approach to Statistical Process Control ofAutocorrelated Data”, Proc. of the Industrial Engineering and ManagementConference, Beer-Sheva, May 3-4, 2000, pp. 194-199 (In Hebrew); Ben-GalI., Shmilovici A. Morag G., “Design of Control and Monitoring Rules forState Dependent Processes”, Proc. of the 2000 International CIRP DesignSeminar, Haifa, Israel, May 16-18, 2000, pp. 405-410; Ben-Gal I.,Shmilovici A., Morag G., “Statistical Control of Production Processesvia Monitoring of Buffer Levels”, Proc. of the 9th InternationalConference on Productivity & Quality Research, Jerusalem, Israel, Jun.25-28, 2000, pp. 340-347; Shmilovici A., Ben-Gal I., “StatisticalProcess Control for a Context Dependent Process Model”, Proc. of theAnnual EURO Operations Research conference, Budapest, Hungary, Jul.16-19, 2000; Ben-Gal I., Shmilovici A., Morag G., “An InformationTheoretic Approach for Adaptive Monitoring of Processes”, ASI2000, Proc.of The Annual Conference of ICIMS—NOE and IIMB, Bordeaux, France, Sep.18-20, 2000; Singer G. and Ben-Gal I., “A Methodology for IntegratingEngineering Process Control and Statistical Process Control”, Proc. ofThe 16 ^(th) International Conference on Production Research, Prague,Czech Republic. 29 Jul.-Aug. 3, 2001; and Ben-Gal I., Shmilovici A.,“Promoters Recognition by Varying-Length Markov Models”, ArtificialIntelligence and Heuristic Methods for Bioinformatics, 30 September-12October, San-Miniato, Italy. The contents of each of the above documentsis hereby incorporated by reference.

SUMMARY OF THE INVENTION

[0192] According to a generalized aspect of the present invention thereis thus provided an algorithm, which can analyze strings of consecutivesymbols taken from a finite set. The symbols are viewed as observationstaken from a stochastic source with unknown characteristics. Without apriori knowledge, the algorithm constructs probabilistic models thatrepresent the classes, dynamics and interrelations within the data. Itthen monitors incoming data strings for compatibilities orincompatibilities with the models that were constructed. Compatibilitiesbetween the probabilistic model and the incoming strings are identifiedand analyzed to trigger appropriate actions such as correctclassification. Incompatibilities between the probabilistic model andthe incoming strings are identified and analyzed to trigger appropriateactions (application dependent).

[0193] According to a first aspect of the present invention there isprovided apparatus for building a stochastic model of a data sequence,said data sequence comprising spatially related symbols selected from afinite symbol set, the apparatus comprising:

[0194] an input for receiving said data sequence,

[0195] a tree builder for expressing said symbols as a series ofcounters within nodes, each node having a counter for each symbol, eachnode having a position within said tree, said position expressing asymbol sequence and each counter indicating a number of itscorresponding symbol which follows a symbol sequence of its respectivenode, and

[0196] a tree reducer for reducing said tree to an irreducible set ofconditional probabilities of relationships between symbols in said inputdata sequence.

[0197] Preferably, the tree reducer comprises a tree pruner for removingfrom said tree any node whose counter values are within a thresholddistance of counter values of a preceding node in said tree.

[0198] Preferably, the various tree construction parameters are userdefinable. Thus, such tree construction parameters include thresholddistance and tree construction parameters are user selectable.Preferably, said user selectable parameters further comprise a treemaximum depth.

[0199] Preferably, said tree construction parameters further comprise analgorithm buffer size

[0200] Preferably, said tree construction parameters further comprisevalues of pruning constants.

[0201] Preferably, said tree construction parameters further comprise alength of input sequences.

[0202] Preferably, said tree construction parameters further comprise anorder of input symbols.

[0203] Preferably, said tree reducer further comprises a path removeroperable to remove any path within said tree that is a subset of anotherpath within said tree.

[0204] Preferably, said sequential data is a string comprisingconsecutive symbols selected from a finite set.

[0205] The apparatus preferably further comprises an input stringpermutation unit for carrying out permutations and reorganizations ofthe input string using external information about a process generatingsaid string.

[0206] Preferably, said string is a nucleic acid sequence.

[0207] Preferably, said string is a promoter and said tree is operableto identify other promoters.

[0208] Preferably, said string is a string of coding DNA, and said treeis operable to identify other coding strings.

[0209] Preferably, said string is a string of non-coding DNA, and saidtree is operable to identify other non-coding strings.

[0210] Preferably, said string is a DNA string and said tree is operableto identify poly-A terminators.

[0211] Preferably, said string has a given property, and said tree isoperable to identify other strings having said given property.

[0212] Preferably, said string is an amino-acid sequence and the symbolscomprise at least some of the 20 amino-acids.

[0213] Preferably, said string is an amino acid string and said tree isoperable to identify at least one of primary, secondary and threedimensional protein structure.

[0214] Preferably, said string has a given property, and said tree isoperable to identify other strings having said given property.

[0215] Preferably, said nucleic acid sequence is a promoter sequence andanother nucleic acid sequence is a non-promoter sequence, wherein saidstochastic modeler is operable to build models of said promoter sequenceand said non-promoter sequence and said comparator is operable tocompare a third nucleic acid sequence with each of said models todetermine whether said third sequence is a promoter sequence or anon-promoter sequence.

[0216] Preferably, said nucleic acid sequence is a coding sequence andanother nucleic acid sequence is a non-coding sequence, wherein saidstochastic modeler is operable to build models of said coding sequenceand said non-coding sequence and said comparator is operable to comparea third nucleic acid sequence with each of said models to determinewhether said third sequence is a coding sequence or a non-codingsequence.

[0217] Preferably, said nucleic acid sequence is a repetitive sequenceand another nucleic acid sequence is a non-repetitive sequence, whereinsaid stochastic modeler is operable to build models of said repetitivesequence and said non-repetitive sequence and said comparator isoperable to compare a third nucleic acid sequence with each of saidmodels to determine whether said third sequence is a repetitive sequenceor a non-repetitive sequence.

[0218] Preferably, said nucleic acid sequence is a non-coding sequenceand another nucleic acid sequence is a coding sequence, wherein saidstochastic modeler is operable to build models of said non-codingsequence and said coding sequence and said comparator is operable tocompare a third nucleic acid sequence with each of said models todetermine whether said third sequence is a non-coding sequence or acoding sequence.

[0219] Preferably, said nucleic acid sequence is an exon sequence,wherein said stochastic modeler is operable to build a model of saidexon sequence and said comparator is operable to compare a secondnucleic acid sequence with said model to determine whether said secondsequence is an exon sequence.

[0220] Preferably, said nucleic acid sequence is an intron sequence,wherein said stochastic modeler is operable to build a model of saidintron sequence and said comparator is operable to compare a secondnucleic acid sequence with said model to determine whether said secondsequence is an intron sequence.

[0221] Preferably, said stochastic model is refinable using further datasequences, thereby to define a structure of a common attribute of saiddata sequences.

[0222] According to a second aspect of the present invention there isprovided apparatus for determining statistical consistency in spatiallyrelated data comprising a finite set of symbols, the apparatuscomprising

[0223] a sequence input for receiving said spatially related data,

[0224] a stochastic modeler for producing at least one stochastic modelfrom at least part of said spatially related data,

[0225] and a comparator for comparing said stochastic model with aprestored model, thereby to determine whether there has been astatistical change in said data.

[0226] Preferably, said stochastic modeler comprises:

[0227] a tree builder for expressing said symbols as a series ofcounters within nodes, each node having a counter for each symbol, eachnode having a position within said tree, said position expressing asymbol sequence and each counter indicating a number of itscorresponding symbol which follows a symbol sequence of its respectivenode, and

[0228] a tree reducer for reducing said tree to an irreducible set ofconditional probabilities of relationships between symbols in said inputdata sequence.

[0229] Preferably, said prestored model is a model constructed usinganother part of said spatially related data.

[0230] Preferably, said comparator comprises a statistical processor fordetermining a statistical distance between said stochastic model andsaid prestored model.

[0231] Preferably, said comparator comprises a statistical processor fordetermining a difference in statistical likelihood between saidstochastic model and said prestored model.

[0232] Preferably, said statistical distance is a relative complexitymeasure. The statistical distance may comprise an SPRT statistic, or anMDL statistic or a a Multinomial goodness of fit statistic or aWeinberger Statistic, or a KL statistic, or any other suitablestatistic.

[0233] Preferably, said tree reducer comprises a tree pruner forremoving from said tree any node whose counter values are within athreshold distance of counter values of a preceding node in said tree.

[0234] Preferably, said threshold distance, and other tree constructionparameters, are user selectable.

[0235] Preferably, tree construction parameters further comprise a treemaximum depth.

[0236] Preferably, tree construction parameters further comprise analgorithm buffer size.

[0237] Preferably, tree construction parameters further comprise valuesof pruning constants.

[0238] Preferably, user selectable parameters further comprise a lengthof input sequences.

[0239] Preferably, tree construction parameters further comprise anorder of input symbols.

[0240] Preferably, said tree reducer further comprises a path removeroperable to remove any path within said tree that is a subset of anotherpath within said tree.

[0241] Preferably, said data comprises a nucleic acid sequence.

[0242] Preferably, said data comprises an amino-acid sequence.

[0243] Preferably, said sequential data is an output of a medical sensorsensing bodily functions.

[0244] Preferably, said nucleic acid sequence is a promoter sequence andanother nucleic acid sequence is a non-promoter sequence, wherein saidstochastic modeler is operable to build models of said promoter sequenceand said non-promoter sequence and said comparator is operable tocompare a third nucleic acid sequence with each of said models todetermine whether said third sequence is a promoter sequence or anon-promoter sequence.

[0245] Preferably, said nucleic acid sequence is a coding sequence andanother nucleic acid sequence is a non-coding sequence, wherein saidstochastic modeler is operable to build models of said coding sequenceand said non-coding sequence and said comparator is operable to comparea third nucleic acid sequence with each of said models to determinewhether said third sequence is a coding sequence or a non-codingsequence.

[0246] Preferably, said nucleic acid sequence is a repetitive sequenceand another nucleic acid sequence is a non-repetitive sequence, whereinsaid stochastic modeler is operable to build models of said repetitivesequence and said non-repetitive sequence and said comparator isoperable to compare a third nucleic acid sequence with each of saidmodels to determine whether said third sequence is a repetitive sequenceor a non-repetitive sequence.

[0247] Preferably, said nucleic acid sequence is a non-coding sequenceand another nucleic acid sequence is a non-non-coding sequence, whereinsaid stochastic modeler is operable to build models of said non-codingsequence and said non-non-coding sequence and said comparator isoperable to compare a third nucleic acid sequence with each of saidmodels to determine whether said third sequence is a non-coding sequenceor a non-non-coding sequence.

[0248] Preferably, said data sequence comprises image data of a firstimage.

[0249] Preferably, said distance is indicative of a statisticaldistribution within said image.

[0250] Preferably, the apparatus further comprises an image comparatorfor comparing said statistical distribution with a statisticaldistribution of another image.

[0251] Preferably, the other image is of a same view as said first imagetaken at a different time, said distance being indicative of timedependent change.

[0252] Preferably, said image data comprises medical imaging data, saidstatistical distance being indicative of deviations of said data from anexpected norm.

[0253] The embodiments are preferably applicable to a database toperform data mining on said database.

[0254] Preferably, said stochastic model is constructed fromdescriptions of a plurality of enzymes for carrying out a given task,said model thereby providing a generic structural description of anenzyme for carrying out said task.

[0255] Preferably, the model is usable to analyze results of a nucleicacid micro array.

[0256] Preferably, the model is usable to analyze results of a proteinmicroarray.

[0257] According to a third aspect of the present invention there isprovided a method of designing a protein for carrying out apredetermined task, the method comprising:

[0258] taking a plurality of proteins known to carry out saidpredetermined task,

[0259] constructing a stochastic model using an amino acid sequence ofsaid plurality of proteins,

[0260] using said stochastic model to predict a protein sequence.

[0261] According to a fourth aspect of the present invention there isprovided a method of designing a protein for carrying out apredetermined task, the method comprising:

[0262] taking a plurality of proteins known to carry out saidpredetermined task,

[0263] constructing a stochastic model using the 3D structure of saidplurality of proteins,

[0264] using said stochastic model to determine a protein structure.

[0265] According to a fifth aspect of the present invention there isprovided a method of distinguishing between biological sequences of afirst kind and biological sequences of a second kind, each kind beingexpressible in terms of a same finite set of symbols, the methodcomprising:

[0266] obtaining a statistically significant set of sequences of saidfirst kind and building a stochastic model thereof,

[0267] obtaining a statistically significant set of sequences of saidsecond kind and building a stochastic model thereof, and

[0268] taking a further sequence and comparing it with each stochasticmodel to determine whether it belongs to either set.

[0269] Preferably, said biological sequences are nucleic acid sequences.

[0270] Preferably, said biological sequences are amino acid sequences.

[0271] Preferably, the sequences of said first kind are promotersequences.

[0272] Alternatively or additionally, the sequences of said first kindare coding and the sequences of said second kind are non-encodingsequences.

[0273] Preferably, the sequences are non-species specific, therebyconstructing models which are non-species specific.

BRIEF DESCRIPTION OF THE DRAWINGS

[0274] For a better understanding of the invention and to show how thesame may be carried into effect, reference will now be made, purely byway of example, to the accompanying drawings.

[0275] With specific reference now to the drawings in detail, it isstressed that the particulars shown are by way of example and forpurposes of illustrative discussion of the preferred embodiments of thepresent invention only, and are presented in the cause of providing whatis believed to be the most useful and readily understood description ofthe principles and conceptual aspects of the invention. In this regard,no attempt is made to show structural details of the invention in moredetail than is necessary for a fundamental understanding of theinvention, the description taken with the drawings making apparent tothose skilled in the art how the several forms of the invention may beembodied in practice. In the accompanying drawings:

[0276] FIG. A1 is a tree diagram describing general methodologies forpattern classification,

[0277] FIG. A2 is a schematic diagram of a prokaryotic gene sequence,

[0278] FIG. A3 is a schematic diagram of a eukaryotic gene sequence,

[0279]FIG. 1 is a simplified diagram showing the interrelationshipsbetween different modeling and characterization methods related toStatistical Process Control and Change Point areas. These areas arerelated to pattern classification and are relevant to the presentedinvention. The figure specifically shows where the present embodimentsfit in with the prior art.

[0280]FIG. 2 is a block diagram of a device for monitoring an inputsequence according to a first preferred embodiment of the presentinvention,

[0281]FIG. 3a is a context tree constructed from a simulator inaccordance with an embodiment of the present invention,

[0282]FIG. 3b is a context tree for 238 E Coli promoters constructed inaccordance with an embodiment of the present invention,

[0283]FIG. 4 is a simplified flow diagram showing a process of buildingan optimal context tree according to embodiments of the presentinvention,

[0284]FIG. 5 is a simplified flow diagram showing a process ofmonitoring using embodiments of the present invention,

[0285]FIG. 6A is a flow diagram showing the procedure for building upnodes of a context tree according to a preferred embodiment of thepresent invention,

[0286]FIG. 6B is a variation of the flow chart of FIG. 6A which carriesout tree growth to reach a predetermined depth faster,

[0287] FIGS. 7-11 are simplified diagrams of context trees at variousstages of their construction,

[0288]FIG. 12 is a state diagram of a stochastic process that can bemonitored to demonstrate operation of the present embodiments, and

[0289] FIGS. 13-23 show various stages and graphs in modeling andattempting to control the process of FIG. 12 according to the prior artand according to the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0290] In the present embodiments, a model-generic patternclassification method and apparatus are introduced for the control ofstate-dependent data. The method is based on the context-tree model thatwas proposed by Rissanen (1983) for data-compression purposes and laterfor prediction and identification (see Weinberger, Rissanen and Feder(1995)). The context-tree model comprises an irreducible set ofconditional probabilities of output symbols given their contexts. Itoffers a way of constructing a simple and compact model to a sequence ofsymbols including those describing complex, non-linear processes such asHMM of higher order. An algorithm is used to construct a context tree,and preferably, the algorithm of the context-tree generates a minimaltree, depending on the input parameters of the algorithm, that fits thedata gathered.

[0291] The present embodiments are based on a modified context-tree thatbelongs to the above category. The suggested model is different from themodels discussed in the background in respect of

[0292] i) in its construction principles;

[0293] ii) its ability to find and compute what we call “partialcontexts” and their probabilities—that were found to have vitalimportance in Biology classification applications; and

[0294] iii) in the suggested distance measures between different trees.

[0295] In order to monitor the statistical attributes of a process, thefirst embodiments compare two context trees at any time duringmonitoring. For example, in certain types of analysis, in particular DNAanalysis, it is possible to divide a sequence into subsequences andbuild trees for each. Thus, it is possible to compare several pairs oftrees at once with a monitor tree and a reference tree formed frommonitor and reference data respectively for each pair. The comparisonmay be carried out by comparing the likelihood density of the givensequence according to each of the trees and then classifying thesequence upon a given threshold. The first context tree is a referencetree that belongs to a certain type of class, that is to say a model ofhow the classified data is expected to behave. There might be severalcontext trees of this type—each of which belonging to a known class ofsequencesm, such as coding, promoters etc. The second context tree is amonitored tree, generated periodically from a sequence of an unknownclass, which needs to be classified. The tree parameters are oftenunknown and need to be estimated. A preferred embodiment uses maximumlikelihood estimates and likelihood ratios (or log-likelihood ratios) tomeasure a relative ‘distance’ between these two trees with respect to auser-predetermined threshold. There are number of statistics that can beused in addition to or as an alternative to likelihood ratios, as willbe explained in greater detail below.

[0296] As will be explained below, in a first stage in certain of theembodiments, such as for DNA analysis, a string may be divided into aplurality of substrings for each of which a tree is built. Then severalpairs of these trees are compared simultaneously wherein one of thetrees in each pair is a reference tree and is generated from a referenceor training data set, and the monitored tree is generated from themonitored data set.

[0297] In DNA applications in general, in the first stage groups ofstrings are selected that share common properties or functionality—suchas promoters/binding-sites/exons and introns/coding vsnon-coding/amino-acids that have the same secondary (or higher)structure/proteins or enzymes that have certain functionally—e.g.,effects on patient health etc. The groups are taken from a training orlearning set.—From the training set a tree model is built for each groupof strings. In the second stage, however, the tree model is generallyused for RECOGNITION or PREDICTION over a “test set” of strings. Thus,it is possible to recognize if a given string belongs to a group ofpromoters/coding DNA/noncoding DNA/certain group of proteins withcertain important properties/ etc or the model may try to predictcertain properties of a given string (e.g., the secondary structure of agiven sequence of amino acids). Usually this is done by computing thelikelihood of that given string based on the tree models, thus, if (*)Pr{string| Tree No 2}>Pr{string| Tree No 1}>Pr{string| Tree No 4}> . . .we can say that the most likely that the string is recognize to belongto the group that is described by Tree 2 etc. In fact such a query isessentially Bayesian estimation—in the general case if we have anapriori knowledge regarding the distribution of the groups denoted byP{Tree Model}, in the set—then the likelihood are computed by Bayestheorem: Pr{Tree model| string}=P{string| Treemodel}*P{Tree-model}/P{string}. Sometimes, if a priori knowledge of thedistribution of groups in the data is not available, then we may assigna uniform probability to all P{Tree Model} which is equivalent to usingthe simpler form (*) of the likelihood function.

[0298] In certain of the embodiments, Similar models may differ based oncertain changes of the model construction or the classificationalgorithm, such as: i) position dependence/inhomogeneous models; ii)mixed backward/forward algorithms; iii) permutation and reorganizationof input strings—thus, adding outside “information”; iv) type oflevels—e.g., amino acids/nucleotides/proteins, v) divisions ofsubstrings, etc.—some of these modified models are described below.

[0299] A preferred embodiment of the present invention, hereinafterContext-Tree classification (CTC) has several particular advantages.Firstly, the embodiment learns the dynamics and the underlyingdistributions within a data string being monitored as part of modelbuilding. Such learning may be done without requiring a prioriinformation, hence categorizing the embodiment as model-generic.Secondly, the embodiment extends the current limited scope of patternclassification applications to state-dependent classes with varyinglength order and partial leafs, as will be explained in more detailbelow. Thirdly, the embodiment provides convenient monitoring ofdiscrete data.

[0300] A second embodiment uses the Kullback-Leibler (KL) statistic (seeKullback (1978)) to measure a relative distance between the two comparedtrees and derive an asymptotic distribution of the relative distance.Monitoring the KL estimates with respect to the asymptotic distribution,indicates whether there has been any significant change ofcharacteristics in the input data.

[0301] Other embodiments measure the stochastic complexity, or otherstatistic measures to measure a relative distance between the twocompared trees and derive an asymptotic distribution of the relativedistance. Monitoring the analytic distribution of the stochasticcomplexity, indicates whether there has been a significant change in thecharacteristics of the input data that requires a differentclassification. An advantage of the second embodiment over the first oneis that it sometimes requires less monitored data in order to producesatisfactory results.

[0302] Other possible statistics that may be used to measure thedistance between tree models include the following:

[0303] Wald's sequential probability ratio testing (SPRT): Wald's testis implemented both in conventional CUSUM and change-point methods andhas analytical bounds developed based on the type-I and type-II errors.The advantages of this statistic are that one can detect the exactchange point and apply a sequential sample size comparison between thereference and the monitored tree.

[0304] MDL (Minimal Description Length): The MDL is the shortestdescription of a given model and data string by the minimum number ofbits needed to encode them. Such a measure may be used to test whetherthe reference ‘in-control’ context-tree and the monitored context-treeare from the same distribution (see Rissanen (1999)).

[0305] Multinomial goodness of fit tests: Several goodness of fit testsmay be used for multinomial distributions. In general, they can beapplied to tree monitoring since any context tree can be represented bya joint multinomial distribution of symbols and contexts. One of themost popular tests is the Kolmogorov-Smirnov (KS) goodness of fit test.Another important test that can be used for CC is the Andersen-Darling(AD) test (Law and Kelton (1991)). This test is superior to the KS testfor distributions that mainly differ in their tail (i.e., it provides adifferent weight for the tail).

[0306] Weinberger's Statistic: Weinberger et al (1995) proposes ameasure to determine whether the context-tree constructed by contextalgorithm is close enough to the “true” tree model (see eqs. (18), (19)in their paper). The advantage of such a measure is its similarity tothe convergence formula (e.g., one can find bounds for this measurebased on the convergence rate and a chosen string length N). However,the measure has been suggested and is more than adequate for codingpurposes since it assumes that the entire string N is not available.

[0307] Before explaining at least one embodiment of the invention indetail, it is to be understood that the invention is not limited in itsapplication to the details of construction and the arrangement of thecomponents set forth in the following description or illustrated in thedrawings. The invention is applicable to other embodiments or of beingpracticed or carried out in various ways. Also, it is to be understoodthat the phraseology and terminology employed herein is for the purposeof description and should not be regarded as limiting.

[0308] Reference is now made to FIG. 1, which is a chart showingcharacterization of SPC methods. We note that the embodiment of thecontext-tree classification (CTC) is related to the embodiment ofcontext-based statistical process control (CSPC). In fact, bothembodiments are based on the suggested context-tree model, however, eachof which has a different area of applications—CTC for patternclassification and CSPC for statistical process control—we thereafteruse these terms interchangeably. FIG. 1 shows how the context-basedSPC(CSPC) and, thus, the CTC embodiments of the present invention relateto existing methods of SPC methods. As discussed above in thebackground, data sequences can be categorized into independent data andinterrelated data, and each of these categories can make use of modelspecific and model generic methods. The embodiments of the presentinvention denoted CSPC/CTC are characterized as providing a modelgeneric method for interrelated data.

[0309] Reference is now made to FIG. 2, which is a simplified blockdiagram showing a generalized embodiment of the present invention. Inthe embodiment of FIG. 2, an input data sequence 10 arrives at an inputbuffer 12. A stochastic modeler 14 is able to use the data arriving inthe buffer to build a statistical model or measure that characterizesthe data. The building process and the form of the model will beexplained in detail below.

[0310] The modeler 14 does not necessary build models for all of thedata. During the course of processing, it may build a single model or itmay build successive models on successive parts of the data sequence.The model or models are stored in a memory 16. A comparator 18 comprisesa statistical distance processor 20, which is able in one embodiment tomake a statistical distance measurement between a model generated fromcurrent data and a prestored model. In a second embodiment thestatistical distance processor 20 is able to make a statistical distancemeasurement of the distance between two or more models generated fromdifferent parts of the same data. In a third embodiment, the statisticaldistance processor 20 is able to make a statistical distance measurementof the distance between a pre-stored model and a data sequence. Ineither embodiment, the statistical measure is used by the comparator 18to determine whether (or not) a statistically significant change in thedata characteristics has occurred.

[0311] As will be described below, the comparator 18 may use thelog-likelihood ratio or the KL statistical distance measure. KL isparticularly suitable where the series is stationary and sufficientlylong. Other measures are more appropriate where the series is space orotherwise dependent.

[0312] Reference is now made to FIG. 3, which is a simplified diagramshowing a prior art model that can be used to represent statisticalcharacteristics of data. In this section, we introduce the context treesmodel for state-dependent data and the concepts of its constructionalgorithm following the definitions and notations in Rissanen (1983),Weinberger, Rissanen and Feder (1995) and Ben-Gal et al. (2000, 2001). Adetailed walk-through example presenting the context-tree constructionis given in FIGS. 7-11 and Tables A1 A2.

[0313] Consider a sequence (string) of observations x^(N)=x₁, . . . ,x_(N), with elements x_(t) t=1, . . . , N defined over a finite symbolset, X, of size d. In practice, this string can represent a realizationsequence of a discrete variable drawn from a finite-set. Particularly,the discrete variable can be a queue length in a queuing system, such asthe number of parts in a buffer in a production line. For a finitebuffer capacity c, the ‘finite symbol set’ (of possible buffer levels)is X={0,1,2, . . . , c} and d, the symbol-set size, is thus equal tod=c+1. For instance, the string x⁶=1,0,1,2,3,3 represents a sequence ofsix consecutive observations of the buffer level (number of parts) in aproduction line with buffer capacity of c=3.

[0314] A family of probability measures P_(N)(x^(N)), N=0,1, . . . isdefined over the set {X^(N)} of all stationary sequences of length N,such that the marginality condition $\begin{matrix}{{\sum\limits_{x \in X}{P_{N + 1}\left( {x^{N}x} \right)}} = {P_{N}\left( x^{N} \right)}} & (2.1)\end{matrix}$

[0315] holds for all N; x^(N)x=x₁, . . . , x_(N), x; and P₀(x⁰)=1 wherex⁰ is the empty string. For simplification of notations, the sub-indexwill be omitted, so that P_(N)(x^(N))≡P(x^(N))

[0316] One could opt to find a model that assigns the probabilitymeasure (2.1). A possible finite-memory source model of the sequencesdefined above is the Finite State Machine (FSM), which assigns aprobability to an observation in the string based on a finite set ofstates. Hence, the FSM is characterized by the transition function,which defines the state for the next symbol,

s(x ^(N+1))=f(s(x ^(N)),x _(N+1))  (2.2)

[0317] where s(x^(N))∈Γ are the states with a finite state space |Γ|=S;s(x⁰)=s₀ is the initial state; and f:Γ×X→Γ is the state transition mapof the machine. The FSM is then defined by S·(d−1) conditionalprobabilities, the initial state s₀, and the transition function. Theset of states of an FSM should satisfy the requirement that theconditional probability to obtain a symbol given the whole sequence isequal to the conditional probability to obtain the symbol given the paststate, implying that

P(x|x ^(N))=P(x|s(x ^(N)))  (2.3)

[0318] A special case of FSM is the Markov process. The Markov processsatisfies (2.2) and is distinguished by the property that for akth-order Markov process s(x^(N))=x_(N), . . . , x_(N−k+1). Thus,reversed strings of a fixed length k act as source states. This meansthat the conditional probabilities of a symbol given all pastobservations (2.3) depend only on a fixed number of observations k,which defines the order of the process. However, even when k is small,the requirement for a fixed order can result in an inefficientestimation of the probability parameters, since some of the states oftendepend on shorter strings than the process order. On the other hand,increasing the Markov order to find a best fit results in an exponentialgrowth of the number of states, S=d^(k), and, consequently, of thenumber of conditional probabilities to be estimated.

[0319] An alternative model to the Markovian is the context-tree thatwas suggested by Rissanen (1983) for data compression purposes andmodified later in Weinberger, Rissanen and Feder (1995). The treepresentation of a finite-memory source is advantageous since states aredefined as contexts—graphically represented by branches in thecontext-tree with variable length—and hence, requires less estimationefforts than those required for a Markov presentation. The context-treeis an irreducible set of conditional probabilities of output symbolsgiven their contexts. The tree is conveniently estimated by contextalgorithm. The algorithm generates an asymptotically minimal treefitting the data (Weinberger, Rissanen and Feder (1995)). The attributesof the context-tree along with the ease of its estimation make itsuitable for a model-generic classifier, as seen later.

[0320] A context, s(x^(t)), in which the “next” symbol in the stringx_(t+1) occurs is defined as the reversed string (we use the samenotation for contexts as for the FSM states, since here, they followsimilar properties),

s(x ^(t))=x _(t) , . . . , x _(max){0,t−k+1}  (2.4)

[0321] for some k≧0, not necessarily the same for all strings (the casek=0 is interpreted as the empty string s₀). The string is truncatedsince the symbols observed prior to x_(t−k+1) do not affect theoccurrence probability of x_(t+1). For the set of optimal contexts,Γ={s: shortest contexts satisfying (2.3)}, k is selected to attain theshortest contexts for which the conditional probability of a symbolgiven the context is practically equal to the conditional probability ofthat symbol given the whole data, i.e., nearly satisfying (2.3). Thus,an optimal context, s∈Γ, acts as a state of the context-tree, and issimilar to a state in a regular Markov model of order k. However, unlikethe Markov model, the lengths of various contexts do not have to beequal and one does not need to fix k such that it accounts for themaximum context length. The variable context lengths in the context-treemodel result in fewer parameters that have to be estimated and,consequently, require less data to identify the source. It is noted,however, that the optimal contexts model does not necessarily satisfyequation (2.2), since the new state s(x^(N+1)) can be longer thans(x^(N)) by more than one symbol (see Weinberger, Rissanen and Feder(1995)).

[0322] Using the above definitions, a description of the context-treefollows. A context-tree is an irreducible set of probabilities that fitsthe symbol sequence x^(N) generated by a finite-memory source. The treeassigns a distinguished optimal context for each element in the string,and defines the probability of the element, x_(t), given its optimalcontext. These probabilities are used later for classification andidentification—comparing between sequences of observations andidentifying whether they belong to the same class. Graphically, thecontext-tree is a d-ary tree which is not necessarily complete andbalanced. Its branches (arcs) are labeled by the different symbol types.Each node contains a vector of d conditional probabilities of allsymbols x∈X given the respective context (not necessarily optimal),which is represented by the path from the root to that specific node. Anoptimal context s∈Γ of an observation x_(t) is represented by the pathstarting at the root, with branch x_(t) followed by branch x_(t−1) andso on, until it reaches a leaf or a partial leaf

[0323]FIG. 3 exemplify a context-tree that was constructed from asequence of observed buffer levels in a production line. Since in thiscase the buffer has a finite capacity of c=2, there are d=3 symboltypes, where observation, x_(t)∈{0,1,2}, refer to the number of parts inthe buffer at time t. Using the context algorithm, S=5 optimal contextsare found (marked by bolded frame), thus, the set of optimal contexts isa collection of reversed strings Γ={0,2,102,1010,10101} (read from leftto right). The context 1010 is a partial leaf.

[0324] Consider the string x⁶=1,2,0,1,0,1, which is generated from thetree source in FIG. 3. Employing the above definitions, the optimalcontext of the next element, x₇=0, is s(x⁶)=1,0,1,0, i.e., following thereverse string from the root until reaching an optimal context.Accordingly, the probability of x₇ given the context isP(x₇=0|s(x⁶))=0.33. Note that had we used a Markov chain model withmaximal dependency order, which is k=5 (the longest branch in the tree),we would need to estimate the parameters of 3⁵=243 states (instead ofthe five optimal contexts in the context-tree of FIG. 2), although mostof them are redundant.

[0325] The conditional probabilities of symbols given the optimalcontexts, P(x|s) x∈X, s∈Γ, and the marginal probabilities of optimalcontexts P(s), s∈Γ are estimated by the context algorithm. The jointprobabilities of symbols and optimal contexts, P(x,s), x∈X, s∈Γ,represent the context-tree model and are used to derive the classifyingalgorithm. This model might be only an approximated description of thereal generating source, but it is often appropriate for practicalpurposes.

[0326] Reference is now made to FIG. 4, which is a simplified schematicdiagram showing stages of an algorithm for producing a context treeaccording to a first embodiment of the present invention. Theconstruction algorithm of FIG. 4 is an extension of the Contextalgorithm given in Weinberger, Rissanen and Feder (1995). The algorithmpreferably constructs a context-tree from a string of N symbols andestimates the marginal probabilities of contexts and the conditionalprobabilities of symbols given contexts. The algorithm comprises fivestages as follows: two concomitant stages of tree growing 42 anditeratively counter updating and tree pruning 46; a stage of optimalcontexts identification 48; and a stage of estimating context-treeprobability parameters 50.

[0327] In the tree growing stage 42, a counter context-tree, T_(t)0≦t≦N, is grown up to a maximum depth m. Each node in T_(t) contains dcounters—one for each symbol type. The counters, n(x|s), denote theconditional frequencies of the symbols x∈X in the string x^(t) given thecontexts. Concomitantly with the tree growth stage 42, the counterupdating and tree pruning stage 46 ensures that the counter valuesn(x|s) are updated according to symbol occurrences as will be explainedin more detail hereinbelow. The counter context tree is iterativelypruned along with counter updating to acquire the shortest reversedstrings, thereby in practical terms to satisfy equation 2.3, it beingnoted that exact equality is not achieved. In the following stage,selection of optimal contexts 48, a set of optimal contexts r isobtained, based on the pruned counter context tree. In the estimationstage 50, the estimated conditional probabilities of symbols givenoptimal contexts {circumflex over (P)}(x|s), x∈X s∈Γ and the estimatedmarginal probabilities of optimal contexts {circumflex over (P)}(s), s∈Γare derived. As discussed in more detail hereinbelow, both {circumflexover (P)}(x|s) and {circumflex over (P)}(s) are approximatelymultinomially distributed and used to obtain the CSPC control limits.The estimated joint probabilities of symbols and optimal contexts,{circumflex over (P)}(x, s)={circumflex over (P)}(x|s)·{circumflex over(P)}(s), x∈X, s∈Γ, are then derived and represent the context-tree inits final form. It is noted that the term “counter context-tree” is usedto refer to the model as it results from the first three stages in thealgorithm and the term “context-tree” is used to refer to the result ofthe final stage, which tree contains the final set of optimal contextsand estimated probabilities.

[0328] Returning now to FIG. 2, and once a model is obtained forincoming data, the model is compared by comparator 18 with a referencemodel, or more than one reference model, which may be a model of earlierreceived data such as training data or may be an a priori estimate ofstatistics for the data type in question or the like.

[0329] In the following, examples are given based on measurements usingthe KL statistic. However, the skilled person will appreciated thatother statistical measures may be used, including but not restricted tothose mentioned hereinabove.

[0330] Kullback (1978), the contents of which are hereby incorporated byreference, proposed a measure for the relative ‘distance’ or thediscrimination between two probability mass functions Q(x) and Q₀(x):$\begin{matrix}{{K\left( {{Q(x)},{Q_{0}(x)}} \right)} = {{\sum\limits_{x \in X}{{Q(x)}\log \frac{Q(x)}{Q_{0}(x)}}} \geq 0}} & (3.1)\end{matrix}$

[0331] The measure, now known as the Kullback Liebter (KL) measure, ispositive for all non-identical pairs of distributions and equals zeroiff (if and only if) Q(x)=Q₀(x) for every x. The KL measure is a convexfunction in the pair (Q(x),Q₀(x)), and invariant under all one-to-onetransformations of the data. Kullback has shown that the KL distance(multiplied by a constant), between a d-category multinomialdistribution Q(x) and its estimated distribution {circumflex over(Q)}(x), is asymptotically chi-square distributed with d−1 degrees offreedom:$\left. {2\quad {N \cdot {K\left( {{\hat{Q}(x)},{Q(x)}} \right)}}}\rightarrow{{\sum\limits_{x \in X}\frac{\left( {{n(x)} - {N\quad {Q(x)}}} \right)^{2}}{N\quad {Q(x)}}} \sim \chi_{d - 1}^{2}} \right.,$

[0332] where N is the size of a sample taken from the populationspecified by Q(x); n(x) is the frequency of category (symbol type) x inthe sample, ${{\sum\limits_{x \in X}{n(x)}} = N};$

[0333] and {circumflex over (Q)}(x)=n(x)/N is the estimated probabilityof category (symbol type) x.

[0334] The KL measure for the relative ‘distance’ between two jointprobability mass functions Q(xy) and Q₀(xy) can be partitioned into twoterms, one representing the distance between the conditioning randomvariable and the other representing the distance between the conditionedrandom variable: $\begin{matrix}{{K\left( {{Q\left( {x,y} \right)},{Q_{0}\left( {x,y} \right)}} \right)} = {{\sum\limits_{y \in S}{{Q(y)}\log \frac{Q(y)}{Q_{0}(y)}}} + {\sum\limits_{y \in S}{{Q(y)}{\sum\limits_{x \in X}{{Q\left( {xy} \right)}\log \frac{Q\left( {xy} \right)}{Q_{0}\left( {xy} \right)}}}}}}} & (3.3)\end{matrix}$

[0335] In the present embodiments the comparator 18 preferably utilizesthe KL measure to determine a relative distance between twocontext-trees. The first tree, denoted by {circumflex over(P)}_(i)(x,s), represents the monitored distribution of symbols andcontexts, as estimated from a string of length N at the monitoring timei=1,2, . . . The second tree, denoted by P₀(x,s), represents the‘in-control’ reference distribution of symbols and contexts. Thereference distribution is either known a priori or can be effectivelyestimated by the context algorithm from a long string of observedsymbols as will be discussed in greater detail below. In the lattercase, the number of degrees of freedom is doubled.

[0336] tilizing what is known as the minimum discrimination information(MDI) principle (see Alwan, Ebrahimi and Soofi (1998)), the contents ofwhich are herein incorporated by reference, the context algorithmpreferably generates a tree of the data being monitored, the tree havinga similar structure to that of the reference tree. Maintaining the samestructure for the current data tree and the reference tree permitsdirect utilization of the KL measure.

[0337] Now, new observations are constantly being collected and may beused for updating the current data tree, in particular the countersthereof and thus updating the statistics represented by the tree. Asignificant change in the structure of the tree may be manifested in thetree counters and the resulting probabilities.

[0338] Using equation (3.3) above, it is possible to decompose the KLmeasured distance between the current data context-tree and thereference context-tree (both represented by the joint distributions ofsymbols and contexts) into a summation involving two terms as follows:$\begin{matrix}{{K\left( {{{\hat{P}}_{i}\left( {x,s} \right)},{P_{0}\left( {x,s} \right)}} \right)} = {{\sum\limits_{s \in \Gamma}{{{\hat{P}}_{i}(s)}\log \frac{{\hat{P}}_{i}(s)}{P_{0}(s)}}} + {\sum\limits_{s \in \Gamma}{{{\hat{P}}_{i}(s)}{\sum\limits_{x \in X}{{{\hat{P}}_{i}\left( {xs} \right)}\log {\frac{{\hat{P}}_{i}\left( {xs} \right)}{P_{0}\left( {xs} \right)}.}}}}}}} & (3.4)\end{matrix}$

[0339] Of the two terms being summated, one measures the KL distancebetween the trees' context probabilities, and the other measures the KLdistance between the trees' conditional probabilities of symbols givencontexts.

[0340] Under the null hypothesis that the monitored tree {circumflexover (P)}_(i)(x,s) is generated from the same source that generated,P₀(x, s) and by using the multinomial approximation referred to above,it is possible to derive an asymptotic probability density function ofthe KL measure between {circumflex over (P)}_(i)(x,s) and P₀(x,s), i.e.,$\begin{matrix}\begin{matrix}\begin{matrix}\begin{matrix}\begin{matrix}\left. {K\left( {{{\hat{P}}_{i}\left( {x,s} \right)},{P_{0}\left( {x,s} \right)}} \right)}\rightarrow \right. \\{{{\frac{1}{2\quad N}\chi_{S - 1}^{2}} + {\sum\limits_{s \in \Gamma}{{{{\hat{P}}_{i}(s)} \cdot \frac{1}{2{n(s)}}}\chi_{d - 1}^{2}}}} =}\end{matrix} \\{{{\frac{1}{2\quad N}\chi_{S - 1}^{2}} + {\sum\limits_{s \in \Gamma}{{\frac{n(s)}{N} \cdot \frac{1}{2{n(s)}}}\chi_{d - 1}^{2}}}} =}\end{matrix} \\{{{\frac{1}{2\quad N}\chi_{S - 1}^{2}} + {\frac{1}{2\quad N}{\sum\limits_{s \in \Gamma}\chi_{d - 1}^{2}}}} =}\end{matrix} \\{{{\frac{1}{2\quad N}\left( {\chi_{S - 1}^{2} + \chi_{S{({d - 1})}}^{2}} \right)} = {\frac{1}{2\quad N}\chi_{{S\quad d} - 1}^{2}}},}\end{matrix} & (3.5)\end{matrix}$

[0341] where n(s) is the frequency of an optimal context s∈Γ in thestring; N is the size of the monitored string; S is the number ofoptimal contexts; and d is the size of the symbol set. As mentionedabove, if the reference tree has to be estimated, the number of degreesof freedom may be doubled. Thus, the KL statistic for the jointdistribution of the pair (X,Γ) is asymptotically chi-square distributedwith degrees of freedom depending on the number of symbol types and thenumber of optimal contexts. The result is of significance for thedevelopment of control charts for state-dependant discrete data streamsbased on the context-tree model.

[0342] Now, given a type I error probability α, the control limits forthe KL statistic are given by, $\begin{matrix}{0 \leq {2\quad {N \cdot {K\left( {{{\hat{P}}_{i}\left( {x,s} \right)},{P_{0}\left( {x,s} \right)}} \right)}}} \leq {\chi_{{{S\quad d} - 1},{1 - \alpha}}^{2}.}} & (3.6)\end{matrix}$

[0343] Thus, the upper control limit (UCL) is the 100(1−α) percentile ofthe chi-square distribution with (Sd−1) degrees of freedom.

[0344] The control limit (3.6) has the following, advantageous,characteristics:

[0345] i) It is a one-sided bound; if the KL value is larger than theUCL, the process is assumed to be ‘out-of-control’ for a given level ofsignificance.

[0346] ii) The control limit lumps together all the parameters of thecontext-tree, in contrast with traditional SPC where each processparameter is controlled separately. Nevertheless, the KL statistic ofthe tree can be easily decomposed to monitor separately each node in thecontext-tree. This can be beneficial when looking for a cause of an‘out-of-control’ signal.

[0347] iii) If Sd is large enough, the KL statistic is approximatelynormally distributed. Hence, conventional SPC charts can be directlyapplied to monitor the proposed statistic.

[0348] A basic condition for applying the KL statistic to sample datarequires that P₀(x|s)>0, ∀x∈X, ∀s∈Γ. Such a constraint may be satisfiedwith the predictive approach, i.e.,${{\hat{P}\left( {xs} \right)} = {\frac{{n\left( {xs} \right)} - {\sum\limits_{b \in X}{n\left( {x{s\quad b}} \right)}} + {1/2}}{{n(s)} + {d/2}}\quad {\forall x}}},{b \in X},{s \in \Gamma}$

[0349] where all probability values assigned to any of the symbol typesare strictly positive, in contrast to the non-predictive approach:${{\hat{P}\left( {xs} \right)} = {\frac{{n\left( {xs} \right)} - {\sum\limits_{b \in X}{n\left( {x{s\quad b}} \right)}}}{n(s)}\quad {\forall x}}},{b \in X},{s \in \Gamma}$

[0350] by defining $\frac{0}{0} \equiv 0.$

[0351] The choice among these alternative procedures, depends both onthe knowledge regarding the system states and on the length of thestring used to construct the context-tree. However, in the latternon-predictive case, the number of degrees of freedom is adaptedaccording to the number of categories that are not equal to zero, thus,subtracting the zero-probability categories when using thenon-predictive approach.

[0352] Another example for a distance measure between two tree models isthe use of log-likelihood ratios. This example considers an applicationof pattern recognition of E. coli promoters. The details of theexperiment are listed as follows.

[0353] The 238 DNA strings of size 12 from a given database wereconverted to strings of numbers, and encode such that: “A”=1; “C”=2;“G”=3; “T”=4. A special version of the context-tree constructionalgorithm “cont12” was adapted for DNA sequences of size 12. It wasadapted such that, the tree construction will use only contexts oflength up to 5, and that the context buffer will be reset every time itreaches the size eleven. Thus, effectively, each appearance of a size 12promoter will update the statistics of the context tree, and grow a treewith up to a depth of 5 levels (maximum context length of 5). The treeconstruction parameters were not optimized since it was conducted mainlyfor illustration purpose.

[0354] Reference is now made to FIG. 3b. The 238 number stings wereconcatenated to one large string “s”. A 5 levels context tree wasidentified from the 238 promoter DNA by using defalt tree parameters.The likelihood of subsequences given the context trees can be computedfor promoter and nonpromoter sequences. For example, the probability ofthe string GCTTA, according to the context tree in FIG. 3b, iscalculated by parssing the string to identified contextsP{GCTTA}=P(G)*P(C|G)*P(T|GC)*P(T|GCT)*P(A|GCTT)=(see the context tree inFIG.3b)=P(G)*P(C|G)*P(T)*P(T|T)*P(A|CTT)=0.4136*0.2164*0.3988*0.3589*0.5385.

[0355] As explained above, the distance measure between two models for agiven string is obtain by computing the likelihood of that given stringbased on the tree models, thus, if (*) Pr{string| Tree No 2}>Pr{string|Tree No 1}>Pr{string| Tree No 4}> . . . one can say that the most likelythat the string is recognize to belong to the group that is described byTree 2 etc.

[0356] Reference is now made to FIG. 5A, which is a simplified flowchart illustrating the control procedure used by the device of FIG. 2 tocontrol a process or the like. In FIG. 5, a first stage 60 comprisesobtaining a reference context-tree, P₀(x,s). This may be doneanalytically or by employing the context algorithm to a long string ofrepresentative data, for example from a training set, or the model maybe obtained from an external source.

[0357] In a second stage 62, a data source is monitored by obtainingdata from the source. At succeeding points, a data sample is used togenerate a current data tree {circumflex over (P)}_(i)(x,s) from asample of sequenced observations of size N. The sample size preferablycomplies with certain conditions that will be discussed in detail hereinbelow. The sequences can be mutual-exclusive, or they can share somedata points (often this is referred to as “sliding window” monitoring).The order of the sequence can be reorganize or permute in various waysto comply with time-dependent constraints or other type ofside-information, which is available. Each sequence used to generate amodel is referred to herein below as a “run” and contributes amonitoring point in the CSPC chart or a classification decision to CTC.Following the MDI principle referred to above, the structure of thecurrent data tree is selected to correspond to the structure of thereference context-tree. Once the structure of the tree has beenselected, then, in a model building stage 64, the counters of thecurrent data context tree are updated using values of the string, andprobability measures of the monitored context-tree are obtained, as willbe explained in greater detail below.

[0358] Once the model has been built then it may be compared with thereference model, and thus the (log) likelihood ratio or the KL value canbe calculated to give a distance between the two models in a step 66. Asmentioned above, the log likelihood ratio is compared to a userpredefined value. The KL value measures a relative distance between thecurrent model and thus the monitored distributions {circumflex over(P)}_(i)(x,s), and the reference distributions {circumflex over(P)}₀(x,s) as defined in the reference model. In some cases it might bevaluable to use several distance measures simultaneously and interpolateor average their outcomes.

[0359] Referring to the CTC the query step 68 indicates whether theobtain statistic value point to one of the classes. When considering theCSPC, the KL statistic value can plotted on a process control chartagainst process control limits in a query step 68. The control limitsmay for example comprise the upper control limit (UCL) given in equation(3.6) above. If the KL value is larger than the UCL it indicates that asignificant change may have occurred in the process and preferably analarm is set.

[0360] The process now returns to step 62 to obtain a new run of data,and the classifying process is repeated until the end of the process.

[0361] Considering FIG. 5A in greater detail, the data sample obtainedat stage 62 may be considered as a sequence of observations x^(N)=x₁, .. . , x_(N), with elements x_(t) x_(t)=1, . . . , N defined over afinite symbol set, X, of size d. In stage 64, a primary output is acontext tree TN for the sample, which context tree contains optimalcontexts and the conditional probabilities of symbols given the optimalcontexts. Namely, it is a model of the incoming data, incorporatingpatterns in the incoming data and allowing probabilities to becalculated of a likely next symbol given a current symbol.

[0362] Reference is now made to FIG. 5B, which is a simplified flowchart showing how the same measurement may be carried out usingstochastic complexity. A reference tree is initially obtained in step60A. Then a data sample is obtained in step 62A. Stochastic complexityis calculated in step 64A and control limits are calculated in step 66A.Finally, the sample values are tested in step 68A to determine how toclassify the sequence or whether the stochastic complexity is within thecontrol limits.

[0363] Reference is now made to FIG. 6A, which is a simplified flowchart showing an algorithm for carrying out stage 64 in FIG. 5, namelybuilding of a context tree model based on the sample gathered in step62. More specifically, FIG. 6A corresponds to stages 42 and 46, in FIG.4. The tree growing algorithm of FIG. 6A constructs the tree accordingto the following rules (the algorithm depends on parameters that can bemodified and optimized by the user):

[0364] A stage S1 takes a single root as the initial tree, T₀, and allsymbol counts are set to zero. Likewise a symbol counter t is set to 0.

[0365] A stage S2 reads a (t+1)^(th) symbol from the input sequence,thus, in the first iteration, x_(t+1)=x₁ is being read.

[0366] In stage S3 the algorithm begins a process of tracing back from aroot node to the deepest node in the tree. If i=0 then tracing remainswithin the same root, otherwise the algorithm chooses the branch T_(t)representing symbol x_(t−i+1). Stage S4 is part of the traceback processof step S3. As each node in the tree is passed, the counter at thatnode, corresponding to the current symbol, is incremented by one.

[0367] In step S5, the process determines whether it has reached a leaf,i.e., a node with no descendents nodes. If so, the process continueswith S6, otherwise it returns to S3.

[0368] S6 controls the creation of new nodes. S6 checks that the lastupdated counter is at least one and that further descendents nodes canbe opened. It will, for example, detect a counter set to zero in stepS8. Preferably, the last updated count is at least 1, i≦m(the maximumdepth) and t−i≧0. Step S7 creates a new node corresponding to x_(t−i+1).Step S8 generates one counter with value 1 and the other counters withzero value at new node creation. Those values may be detected by S6 whenanother symbol is read. Step S10 controls the retracing procedure neededto stimulate tree growth to its maximal size, by testing i≦m and t−1≧0and branching accordingly.

[0369] Once a leaf has been reached in S5 or S10, then the tracebackprocedure is complete for the current symbol. The maximal alloweddeepest node is set at an arbitrary limit (e.g. 5) to limit tree growthand size, and save computations and memory space. Without such a limitthere would be a tendency to grow the tree to a point beyond which it isvery likely to be pruned in any case.

[0370] Stage S6 thus checks whether the last updated count is at leastone and if maximum depth has yet been reached. If the result of thecheck is true, then a new node is created in step S7, to which symbolcounts are assigned, all being set in step S8 to zero except for thatcorresponding to the current symbol, which counter is set to 1. Theabove procedure is preferably repeated until a maximum depth is reachedor a context x_(t)x_(t−1) . . . x₁ is reached in stage S10. Thereafterthe next symbol is considered in stage S2.

[0371] More specifically, having recursively constructed an initial treeT_(t) from an initial symbol or string x^(t), the algorithm moves aheadto consider the next symbol x_(t+1). Then tracing back is carried outalong a path defined by x_(t),x_(t−1), . . . and in each node visitedalong the path, the counter value of the symbol x_(t+1) is incrementedby 1 until the tree's current deepest node, say x_(t), . . . ,x_(t−l+1), is reached. Although not shown in FIGS. 6A and 6B, an initialstring preceding x^(t) may be used in order to account for initialconditions (see Rissanen 1983, the contents of which are herebyincorporated by reference).

[0372] If the last updated count is at least 1, and l<m, where m is themaximum depth, the algorithm creates new nodes corresponding to x_(t−r),l<r≦m, as descendent nodes of the node defined in S6 The new node isassigned a full set of counters which are initialized to zero except forthe one counter corresponding to the current symbol x_(t+1), which isset to 1. Retracing is continued until the entire past symbol history ofthe current input string has been mapped to a path for the currentsymbol x_(t+1) or until m is reached. r being the depth of the newdeepest node, reached for the current path, after completing stage S7.

[0373] Reference is now made to FIG. 6B, which is a simplified flowchart showing a variation of the method of FIG. 6A. Steps that are thesame as those in FIG. 6A are given the same reference numerals and arenot referred to again except as necessary for understanding the presentembodiment.

[0374] In FIG. 6B, the steps S9 and S10 are removed, and step S8 isfollowed directly by step S11, thereby to reduce the computationalcomplexity. While the previous algorithm is more accurate, in thisalgorithm, the tree grows slowly—at most one new node per symbol. Thusin the beginning—when the tree has not yet grown to its maximaldepth—some counts are lost. If the sequence length is much longercompared to the maximal tree depth, than the difference in the countervalues produced by both algorithms will be practically insignificant forthe nodes left after the pruning process.

[0375] In order to understand better the algorithms of FIG. 6, referenceis now made to FIGS. 7-9 which are diagrams of a model being constructedusing the algorithm of FIG. 6. Further illustrations are given in TablesA1 and A2.

[0376] In FIGS. 7-9, tree 100 initially comprises a root node 102 andthree symbol nodes 104-108. Each one of nodes 102-108 has threecounters, one for each of the possible symbols “a”, “b” and “c”. Thecounters at the root node give the numbers of appearance of therespective symbols and the counters at the subsequent, or descendent,nodes represent the numbers of appearances of the respective symbolfollowing the symbol path represented by the node itself. Thus the node104 represents the symbol path “a”. The second counter thereinrepresents the symbol “b”. The counter being set to 1 means that in thereceived string so far the number of “b”s following an “a” is 1. Node106 represents the symbol path “b” and the first counter represents thesymbol “a”. Thus the first counter being set to “1” means that in thereceived string so far the symbol “a” has appeared once following a “b”.The second counter being on “0” implies that there are no “b”s followedby “b”s.

[0377] Node 108 represents context “b a” corresponding to the symbolpath or the sequence “a b” (recall that contexts are written in reverseorder). The first counter, representing “a” being set to “1” shows thatthere is one instance of the sequence “a b” being followed by “a”.

[0378] In FIG. 8, a fourth symbol x₄=b is received. The steps S3 to S10of FIG. 6 are now carried out. The symbol b, as preceded by “a b a” inthat order can be traced back from node 104 to 102, (because thetraceback covers the “b a” suffix of the sequence). The “b” counters areincremented at each mode passed in the traceback. Likewise the sequence“a b a” can be traced back from node 108 to the root, again incrementingthe “b” counters each time.

[0379] In FIG. 9, a new node 110 is added after node 104, representingstep S7 of FIG. 6. The node is assigned three counters as with allprevious nodes. The “b” counter thereof is set to 1 and all othercounters are set to “0” as specified in step S8 of FIG. 6. It is notedthat the context for the new node is “a b”, thus, representing thesequence “b a”.

[0380] Returning now to the tree pruning stage 46 of FIG. 4, it isnecessary to prune the tree, as will be described below, to obtain whatmay be referred to as the optimal contexts of TN. Tree pruning isachieved by retaining the deepest nodes w in the tree that practicallysatisfy equation 2.3 above. The following two pruning rules apply (seeWeinbeger, Rissanen and Feder (1995) for further details):

[0381] Pruning rule 1: the depth of node w denoted by |w| is bounded bya logarithmic ratio between the length of the string and the number ofsymbol types, i.e., |w|≦log(t+1)/log(d); and,

[0382] Pruning rule 2: the information obtained from the descendantnodes, sb ∀b∈X, compared to the information obtained from the parentnode s, is larger than a ‘penalty’ cost for growing the tree (i.e., ofadding a node).

[0383] The driving principle is to prune any descendant node having adistribution of counter values similar to that of the parent node. Inparticular, we calculate Δ_(N)(sb), a measure of the (ideal)code-length-difference of the descendant node sb, ∀b∈X, $\begin{matrix}{{\Delta_{N}({sb})} = {\sum\limits_{x \in X}{{n\left( x \middle| {sb} \right)}{\log \left( \frac{\hat{P}\left( x \middle| {sb} \right)}{\hat{P}\left( x \middle| s \right)} \right)}}}} & (4.1)\end{matrix}$

[0384] and then require that Δ_(N)(w)≧C(d+1)log(t+1), wherein

[0385] logarithms are taken to base 2; and

[0386] C is a pruning constant tuned to process requirements (withdefault C=2 as suggested in Weinberger, Rissanen and Feder (1995)).

[0387] The tree pruning process is extended to the root node withcondition Δ_(N)(x⁰)=∞, which condition implies that the root node itselfcannot be pruned.

[0388] Reference is now made to FIG. 10, which is a simplified diagramshowing a pruned counter context-tree 112 constructed by applying thetree building of FIG. 6 followed by tree pruning on a string containing136 symbols—eight replications of the sub string:(a,b,a,b,c,a,b,a,b,c,a,b,a,b,a,b,c). The tree comprises a root node 114and five further nodes 116-124. By contrast, the unpruned tree, fromwhich this was taken, may typically have had three decsendent nodes foreach node

[0389] Returning again to FIG. 4, and stage 48, selection of optimalcontexts, is now described in greater detail. In stage 48, a set ofoptimal contexts, r, containing the S shortest contexts satisfyingequation 2.3 is specified. An optimal context can be either a path to aleaf (a leaf being a node with no descendants) or a partial leaf in thetree. A partial leaf is defined for an incomplete tree as a node, whichis not a leaf. Now, for certain symbol(s) the path defines an optimalcontext satisfying equation 2.3, while for other symbols equation 2.3 isnot satisfied and a descendant node(s) is created. The set of optimalcontexts is specified by applying the following rule: $\begin{matrix}{{\Gamma = {\left\{ {s:{{\sum\limits_{x \in X}\left( {{n\left( x \middle| s \right)} - {\sum\limits_{b \in X}{n\left( x \middle| {sb} \right)}}} \right)} > ɛ}} \right\} \quad {\forall{s \in T_{t}}}}},} & (4.2)\end{matrix}$

[0390] where ω=0 is the default value of a user-defined parameter. Thismeans that Γ contains only those contexts that are not part of longercontexts. When the inequality in expression 4.2 turns into equality,that context is fully contained in a longer context and, thus, is notincluded in F. It is noted that in each level in the tree there is acontext that does not belong to a longer context and, therefore, doesnot satisfy equation 4.2. This is generally due to initializingconditions. Such inconsistency can be solved by introducing aninitiating symbol string as suggested in Weinberger, Rissanen and Feder(1995).

[0391] In summary, Γ contains all the leaves in the tree as well aspartial leaves satisfying equation 4.2 for certain symbols.

[0392] Returning to FIG. 10, and node 122 corresponds to string historyor context s=ba and is, as defined above, a partial leaf (acts as anoptimal context) for symbols a and c. This is firstly because the longercontext s=bac does not include all symbol occurrences of symbols a andc. Secondly, the contexts bab and baa were pruned and lumped into thecontext ba. Applying equation 4.2 to the pruned counter context-treepresented in FIG. 10 results in four optimal contexts Γ={a; bac; c; ba}.The first three contexts in F are leaves, the latter is a partial leafand defines an optimal context for symbols a and c. Considering now ingreater detail stage 50, estimation of parameters in FIG. 4, theestimation stage is composed of three steps as follows:

[0393] 1) the probabilities of optimal contexts are estimated anddenoted by {circumflex over (P)}(s), s∈Γ;

[0394] 2) the conditional probabilities of symbols given the optimalcontexts are estimated and denoted by {circumflex over (P)}(x|s), x∈X,s∈Γ; and

[0395] 3) the estimated joint probabilities of symbols and optimalcontexts are calculated {circumflex over (P)}(x, s), x∈X, s∈Γ.

[0396] Given the set of optimal contexts and the pruned counter tree,the probability of optimal contexts in the tree, {circumflex over(P)}(s), s∈Γ, are estimated by their frequency in the string:$\begin{matrix}{{\hat{P}(s)} = {\frac{n(s)}{\sum\limits_{s \in \Gamma}{n(s)}} = \frac{\sum\limits_{x \in X}\left( {{n\left( x \middle| s \right)} - {\sum\limits_{b \in X}{n\left( x \middle| {sb} \right)}}} \right)}{\sum\limits_{s \in \Gamma}{\sum\limits_{x \in X}\left( {{n\left( x \middle| s \right)} - {\sum\limits_{b \in X}{n\left( x \middle| {sb} \right)}}} \right)}}}} & {{\forall{x \in X}},{s \in \Gamma}}\end{matrix}$

[0397] where n(s) is the sum of the symbol counters in the correspondingleaves (or partial leaves) that belong to the optimal context s and notto a longer context sb b∈X. Each symbol in the string thus belongs toone out of S disjoint optimal contexts, each of which contains n(s)symbols. An allocation of symbols of a sufficiently long string todistinctive optimal contexts can be approximated by the multinomialdistribution.

[0398] Returning to FIG. 10, and the estimated probabilities of optimalcontexts in FIG. 6 are given respectively by,

[0399] {{circumflex over (P)}(a), {circumflex over (P)}(bac),{circumflex over (P)}(c), {circumflex over (P)}(ba)}={56/136,24/136,24/136,32/136}.

[0400] Once the symbols in the string are partitioned to S substrings ofoptimal contexts, the conditional probabilities of symbol types given anoptimal context are estimated by their frequencies in the respectivesubstring, $\begin{matrix}\begin{matrix}{{\hat{P}\left( x \middle| s \right)} = \frac{{n\left( x \middle| s \right)} - {\sum\limits_{b \in X}{n\left( x \middle| {sb} \right)}}}{n(s)}} & {{\forall x},{b \in X},{s \in \Gamma}}\end{matrix} & (4.3)\end{matrix}$

[0401] where $\frac{0}{0} \equiv 0.$

[0402]  The distribution of symbol types in a given optimal context is,thus, approximated by another multinomial distribution.

[0403] An alternative predictive approach to equation 4.3 was suggestedin Weinberger, Rissanen and Feder (1995). It is implemented in caseswhere one needs to assign strictly positive probability values tooutcomes that may not actually have appeared in the sample string, yetcan occur in reality. Thus, $\begin{matrix}\begin{matrix}{{\hat{P}\left( x \middle| s \right)} = \frac{{n\left( x \middle| s \right)} + \frac{1}{2}}{{\sum\limits_{x \in X}{n\left( x \middle| s \right)}} + \frac{d}{2}}} & {{\forall{x \in X}},{s \in \Gamma}}\end{matrix} & (4.4)\end{matrix}$

[0404] The choice among the two alternative procedures above dependsboth on the knowledge regarding the system states and on the length ofthe string used to construct the context-tree. The latter approach issuitable for applications involving forecasting.

[0405] Reference is now made to FIG. 11, which is a simplified treediagram showing a tree 130 having a root node 132 and five other nodes134-140. The counters now contain probabilities, in this case theestimated conditional probabilities of symbols given contexts. Theprobability estimates are generated by applying equation 4.3 to thecounter context-tree 112 of FIG. 11. For example, the conditionalprobability of a symbol type x∈{a, b, c} given the context s=a, isestimated as {circumflex over (P)}(x|a)=(0,56/56,0)=(0,1,0), whereas theconditional probabilities of a symbol x∈{a,b,c} given the context s=bais estimated as {circumflex over(P)}(x|ba)=(8/32,0,24/32)=(0.25,0,0.75). The probabilities of symbols innon-optimal contexts are also shown for general information.

[0406] As shown in the figure the optimal contexts are{132,140,136,138}.

[0407] Returning now to FIG. 3a tree 30 is the context tree generatedfrom a string of length N=1088 (64 replications of the basic stringa,b,a,b,c,a,b,a,b,c,a,b,a,b,a,b,c). The maximum tree depth has increasedfrom three to five levels, as opposed to the preceding examples.Nevertheless, the number of optimal contexts acting as states has onlyincreased from four to five. It is pointed out that had a Markov chainmodel of order k=5 been used, it would have been necessary to estimatetransition probabilities between 3⁵=243 states, most of which areredundant in any case.

[0408] The joint probabilities of symbols and optimal contexts thatrepresent the context-tree in its final form are evaluated thus:

{circumflex over (P)}(x,s)={circumflex over (P)}(x|s)·{circumflex over(P)}(s), x∈X, s∈Γ.

[0409] As explained above with respect to FIG. 5, the model as built inaccordance with the procedures outlined in FIGS. 6-11 can be used in thecomparison stage of FIG. 5 to obtain information about the comparativestatistical properties of data sequences.

[0410] The above procedure is summarized and illustrated in thefollowing two tables for string x⁶=4,4,4,3,3,2 where d=5, X={0,1,2,3,4}:TABLE A1 Tree growing and counter updating stage in context algorithmSteps Tree Description Step 0: T₀

Initialization: the root node, μ, denotes the empty context Step 1: T₁x¹= 4

The only context for the first symbol is μ, the counter n(x = 4|μ) wasincremented by one. Step 2: T₂x² = 4, 4

The counters n(x = 4|μ) and n(x = 4|s = 4) are incremented by one. Thenode of the context s = 4 is added to accumulate the counts of symbolsgiven the context s = 4. Step 3: T₃x³ = 4, 4, 4,

The counter of the symbol 4 is incremented by one in the nodes from theroot to the deepest node along the path defined by the pastobservations. In this case, the counters - n(x = 4|λ) and n(x = 4|s = 4)are incremented by 1. A new node is added for the context s = 44. Andn(x = 4|s = 44) = 1. Stage 4: T₄x⁴ = 4, 4, 4, 3

The counters - n(x = 4|λ), n(x = 4|s = 4) and n(x = 4|s = 44) areincremented by one since the past contexts are s = λ, s = 4, s = 44. Anew node is added for the context s = 444 of the observation x₄ = 3.Stage 5: x⁵ = . . ., 4, 3, 3,

Add new nodes for the contexts s = 3, s = 34, s = 344 . . .. Update thecounter of the symbol x = 3 from the root to the deepest node on thepath of past observations. Stage 6: x⁶ = . . ., 3, 3, 2

Update the counter of the symbol x = 2 from the root to the deepest nodeon the path of past observations. Add the contexts: s = 33 and so on.

[0411] TABLE A2 Pruning stage in context algorithm for the string RuleTree Description Rule 1:

Rule 1: Maximum tree Depth ≦ log(t)/log(d) = log(6)/log(5) = 1.11 Themaximum tree depth is of level one. Thus, all nodes of level 2 and beloware trimmed. Rule 2:

Rule 2: for the rest of the nodes in level one and the root, we applytrimming rule 2. The threshold for (C = 2 is: Δ₆(u) × 2(d + 1)log(t + 1)= 33.7 And for each of the nodes:

The code-length difference is below the threshold, hence the first levelnodes are trimmed.

[0412] Tuning of the Model for Specific Conditions

[0413] The use of context trees as modelers for analysis and patternrecognition, in particular for biological data sets, may be modified invarious ways to improve performance. It is noted that the variousparameters of context tree model can be used to optimize performance fordifferent pattern recognition problems, by using side informationspecific to the problem, and by using empirical and numerical tests. Theflexibility of the modelers is too large to be indicated fully here.Nevertheless, in the following list several optional modifications ofthe model parameters are indicated for specific experimental conditions:

[0414] Combination of several separate context tree models: severalfixed string length models are tuned together. That is to say each modelis tuned on a different part of a long input string, throughmanipulation of the context tree algorithm with limited size buffers andwith jumping over irrelevant string segments. Then, instead of using agiven string length, the string is divided into segments and eachsegment is handled by a different context tree model. For example, inthe E-coli promoter-recognition problem, it is known that most promotershave two different recognition sites, each having six base-pairs (bp)and separated by an intermediate segment of variable length. Therefore,one can modify the input string length for training and identification.

[0415] For a model based on a string length of 12 bp (and withoutimplementing other optimization procedures—such as non-homogeneoustrees), a context tree was experimentally identified that yielded a TruePositives value of 69.75% and a True Negatives value of 67.94% using across-validation set. Then by using the same non-optimized model whichis based on two separated context trees, each of which was for inputstrings of 6 bp, the results improved to a True Positives value of71.43% and a True Negatives value of 72.01% for the cross-validationset. Optimizing the other parameters of the tree model, enabled adramatic improvement to yield a True Positives value of 92.86% and aTrue Negatives value of 91.5%. The same idea may be extended to using amodel which is based on four different context trees for input stringsof 3 bp each. Such partitions increase the position-dependantsensitivity of the model, however it leads to some loss of informationregarding the dependence of border regions between the partialsequences. An optimization procedure, thus, may be performed for findingthe optimal split in terms of the recognition efficiency.

[0416] Sliding Windows: the division of an input string into sub-stringscan be performed either to obtain mutual exclusive sub-strings, that donot share common sequences, or in the manner of a sliding-window, thatis to say forming substrings from incremental lengths along thesequence. For example, in the problem of identification of DNA codingregions, at each step an input string of a fixed length contains severalnew bp in comparison to the input string of the last step. In thismanner, the model is gradually modified and the classification can beindicated in earlier stages.

[0417] Non-Homogeneous models: in non-homogenous models, differentalternating context tree models are interlaced in a predeterminedsequence that can capture cyclic phenomena in the string. For example,in the coding, non-coding classification problem, it is known thatcoding DNA segments have an internal structure of triplets (or codons)and the position in the triplet is important for correct classification.It is possible to grow three different context trees—one for each one ofthe relative positions of the basis acids in the codon—and when handlinglarge strings of coding, or suspected coding DNA, the predictions of thethree trees are interlaced together and generate the prediction of theDNA segment.

[0418] To illustrate the above, an experiment was carried out using thebenchmark dataset of Fickett and Tung, (1992), for accuracydiscrimination between coding and non-coding DNA segments of length54/108/162 bp respectively. Accuracies of 85.0/90.5/92.6 respectivelywere obtained using the non-homogenous method compared to accuracies of74.3/77.5/80.5 using only one context tree. The non-homogenous resultsare better than the previously best results on the dataset using othernonhomogenous methods: 80.7/84.9/88.0. That is to say the present methodprovided an approximate 4% improvement.

[0419] Predictive vs. Non Predictive models: Predictive models assignnon-zero probabilities to events that were never observed in thetraining set, while non-predictive models assign zero probability toevents that were never observed in the training set. The choice of theappropriate model is based on the feasibility of the un-observed events:if unobserved events are not feasible, than the non-predictive formulais more accurate. Once again, the use of predictive vs non predictivemodels can be checked against cross-validation properties.

[0420] Selection of the tree truncation threshold: the threshold is oneof the most important default parameters and directly determines thesize of the final model. Indirectly it determines the computationalaspects of the algorithm, and the accuracy of the model. It may beoptimized for each application. For example, in predictive applicationssuch as time series forecasting it was empirically found that a smallerthan default threshold improves the quality of the prediction.

[0421] Tree Construction Parameter: The tree construction parametersproposed in the algorithm are default parameters for optimization. Suchparameters include: i) the tree maximum depth; ii) the algorithm buffersize; iii) value of pruning constants; iv) the length of inputsequences; v) the order of input symbols; vi) the number of nodes togrow after a leaf is reached; vii) other parameters indicated in FIG. 6Aetc. Further improvement of the model is possible by optimization of thedefault parameters in accordance with the conditions of each specificapplication.

[0422] In general the Markov model does not take into account thepossibility of position dependence in the model nor the possibility ofvarying order, partial leafs. The stochastic models discussed herein dotake such position information into account.

Applications

[0423] Use of the above context model to monitor changes in state has awide range of applications. In general the model part may be applied toany process which can be described by different arrangements of a finitegroup of symbols and wherein the relationship between the symbols can begiven a statistical expression, although, in the present disclosure,emphasis is placed on spatially related data. For example—sequences ofnucleotides belonging to an alphabet symbol set of four letters, or asequence of amino acids belonging to an alphabet set of 20 letters, or asequence of proteins structures described by an alphabet set letters forprimary secondary and 3D structure—all make good candidates forconsideration. The comparison stage as described above allows forchanges in the statistics of the symbol relationships to be monitoredand thus modeling plus comparison may be applicable to any such processin which dynamic changes in those statistics are meaningful.

[0424] In the field of biology an important application concerns therecognition and prediction common function amongst biological sequences,for example within promoters and coding sequences from amongst DNAsequences, or in the recognition or prediction of three-dimensionalshape and function in proteins. DNA sequences are sequences of fourbases termed Adenine, A, Cytosine, C, Guanine, G, and Thiamine, T, andcombinations of the four bases provide both acting regulatory sequences,encoding sequences for amino-acids to be used in proteins, intronsequences which are transcribed yet not translated and and non-codingsequences of yet unidentified function, including high, moderate and lowrepetitive have a certain similarity in the statistical distribution ofthe bases, however, such similarity is not at all trivial for detection.

[0425] In general, any kind of biological sequence can be analyzed inthe above way to identify or predict properties. Provided groups ofsequences can be identified that share a common functionality orstructure, they may be used as a training data base to construct a treemodel and later identify whether a new sequence belongs to that group orhas a similar feature or structure that might apply a certain relationto that group.

[0426] Thus, for example, promoters that form a certain group of DNAsequences that share a common functionality can be used as a trainingdata base to construct a tree and later identify whether a new sequencebelongs to that group.

NUMERICAL EXAMPLE

[0427] By way of example, an experimental attempt to identify promoterregions in E. coli DNA was described in FIG. 3b. An input string ofobservations is introduced to a feature transformation module along witha context-tree model. The tree model is constructed by the methodsdescribed above based on a training set generated by concatenatingseveral 12 base-pairs sequences of known promoters. The context-treemodel represents the probabilistic structure in the training set butwith a huge reduction of dimensionality. Instead of representing allconditional statistics that emerged from the training set, the treerepresents only those statistics that were found to be significant. Thetransformation computes the log-likelihood of the input string being a“promoter” or a “non-promoter” according to the tree structure (i.e.,the feature vector, y). The log-likelihood values are then introduced tothe classifier, which decides based on a given threshold whether toclassify this input string as a promoter or not.

[0428] More specifically, in the experiment, 238 DNA sequences takenfrom E-Coli promoters (2 six mers) were coded by a numerical alphabet(explain A=1, C=2, G=3, T=4) and arranged into a long source string S. Acontext tree modeler was set to build a tree from the data using acontext length and deepest node, of 5, that is to say, at each stage ofbuilding of the tree only the five previous symbols were retained and abuffer having symbol length 12 was set. Preferably, the buffer lengthserves to reset the process of growing the context tree.

[0429] In the experiment a training set of 238 E-Coli promoter sequenceswere available. In a first experiment—performed without crossvalidation—all the training set was used to built a tree model and thenthe model was used to specify the likelihood for each promoter—i.e., thepromoter itself was also part of the training set. In the secondexperiment 238 different training sets were used to calculate thelikelihood of each promoter—each training set containing 237 promoters,i.e., without the promoter itself.

[0430] In the cross-validation experiment, reference context trees werebuilt in each case for sequences of 237 promoters and then an attemptwas made to identify the 238^(th) promoter from the source string S.

[0431] The above experiment is referred to as a cross-validationexperiment, and its purpose is to test the quality of a model built with237 promoters on promoter No. 238 which did not take part in the treegrowing process.

[0432] The result achieved was that the third context tree in each case,namely that built from a concatenation of three replicas of the stringS, gave correct identification of the 238th string in 75% ofcases—without optimizing the tree parameters at all, thus providing anillustrative example. In a later experiment, by optimizing certain treeparameters, the accuracy level was raised to 94%.

[0433] The first experiment was repeated with position dependence andwith sub trees and improvements were obtained. Thus, in the sub treesexperiment each promoter was presented by two separate 6 base pairstrees instead of one 12 base pair tree. The reason is that the E Colipromoter is composed of two 6 bp sites that are separated by severalnucleotides. When calculating the likelihood of a promoter, the first 6bp was calculated using the first tree model and the second 6 bp usingthe second model.

[0434] Stochastic models were built in the way described above in orderto distinguish between coding and non-coding DNA sequences. The modelsdemonstrated substantial species independence, although more specificspecies dependent models may provide greater accuracy. The experimentsconcerned the construction of a coding model and a non-coding model eachusing 200 DNA strings divided into test sets and validation setsrespectively.

[0435] Non-homogeneous trees that were applied to DNA segments of length162 bp and a zero threshold yielded a 94.8 percent of correct rejections(True negative) and 93 percent of correct acceptance (True Positive).Using another model with different threshold, we obtained a 99.5% ofcorrect rejections for the coding model and a 21% of false rejections.Using the same model for the non-coding model, the percentage of correctrejections was 100% and the percentage of false rejections was 12%.

[0436] It was noted that the coding model had a much smaller contexttree than the non-coding model.

[0437] Other DNA applications include the ability to distinguish,analyze and classify: i) exons and introns; ii) Splice Sites; iii)Terminators; vi) Poly A; v) Proteins; vi) other biological sequencesthat share functionality features. There is also provided thepossibility of predicting the structure of a protein.

[0438] Medical applications for the above embodiments are numerous. Anysignal representing a body function can be discretized to provide afinite set of symbols. The symbols appear in sequences which can bemodeled and changes in the sequence can be indicated by the comparisonstep referred to above. Thus, medical personnel are provided with asystem that can monitor a selected bodily function and which is able toprovide an alert only when a change occurs. The method is believed to bemore sensitive than existing monitoring methods.

[0439] A further application of the modeling and comparison process isimage processing and comparison. A stochastic model of the kinddescribed above can be built to describe an image, for example a medicalimage. The stochastic model may then be used to compare other images inthe same way that it compares other data sequences. Such a system isuseful in automatic screening of medical image data to identify featuresof interest. The system can be used to compare images of the samepatient taken at different times, for example to monitor progress of atumor. Alternatively, it could be used to compare images taken fromvarious patients already diagnosed with a given condition, against animage showing the same view of a patient being tested.

[0440] A further application of the modeling and comparison process isin pharmaceutical research. A protein (e.g., an enzyme, receptor,ligand, etc.) carrying out a particular function is required. A seriesof proteins which all carry out the required function are sequenced orhave a known sequence and a model derived from all the sequencestogether may define the required sequence for the desired protein. At ahigher level of organization, a protein (e.g., an enzyme, receptor,ligand, etc.) carrying out a particular function is required. A seriesof proteins which all carry out the required function are structurallyanalyzed or have a known structural analysis (obtained, for example, byX-ray cristalography) and a model derived from all the sequencestogether may define the required structure for the desired protein.

[0441] A further application of the modeling and comparison process isto analyze nucleic acid microarrays, such as DNA chips, to determine,for example the nature (e.g., the base sequence) of the materials towhich they have been exposed (e.g., nucleic acids of unknown orundetermined sequence).

[0442] A further application of the modeling and comparison process isto analyze protein (e.g., antigens or antibodies) microarrays, such asprotein chips, to determine, for example the nature of the materials towhich they have been exposed (e.g., antibodies or antigens,respectively).

[0443] A further application of the modeling and comparison process asdescribed above is in forecasting. For example, such forecast can beapplied to natural data—such as weather conditions, or to financialrelated data sucha s stock markets. As the model expresses a statisticaldistribution of the sequence, it is able to give a probability forecastfor a next expected symbol given a particular received sequence. Afurther application of the present embodiments is to sequences ofmulti-input single output data, such as records in a database, which mayfor example represent different features of a given symbol. Consideringa database with records that arrive at consecutive times, the algorithm,(when extended to multi-dimensions), may compare a sequence of recordsand decide whether they have similar statistical properties to theprevious records in the database. The comparison can be used to detectchanges in the characteristics of the source which generates the recordsin the database.

[0444] Likewise the database may already be in place, in which case thealgorithm may compare records at different locations in the database.

[0445] It is appreciated that certain features of the invention, whichare, for clarity, described in the context of separate embodiments, mayalso be provided in combination in a single embodiment. Conversely,various features of the invention which are, for brevity, described inthe context of a single embodiment, may also be provided separately orin any suitable subcombination.

[0446] It will be appreciated by persons skilled in the art that thepresent invention is not limited to what has been particularly shown anddescribed hereinabove. Rather the scope of the present invention isdefined by the appended claims and includes both combinations andsubcombinations of the various features described hereinabove as well asvariations and modifications thereof which would occur to personsskilled in the art upon reading the foregoing description.

1. Apparatus for building a stochastic model of a data sequence, saiddata sequence comprising spatially related symbols selected from afinite symbol set, the apparatus comprising: an input for receiving saiddata sequence, a tree builder for expressing said symbols as a series ofcounters within nodes, each node having a counter for each symbol, eachnode having a position within said tree, said position expressing asymbol sequence and each counter indicating a number of itscorresponding symbol which follows a symbol sequence of its respectivenode, and a tree reducer for reducing said tree to an irreducible set ofconditional probabilities of relationships between symbols in said inputdata sequence.
 2. Apparatus according to claim 1, said tree reducercomprising a tree pruner for removing from said tree any node whosecounter values are within a threshold distance of counter values of apreceding node in said tree.
 3. Apparatus according to claim 2, whereinsaid threshold distance and tree construction parameters are userselectable.
 4. The apparatus of claim 3, wherein said user selectableparameters further comprise a tree maximum depth.
 5. The apparatus ofclaim 3, wherein said user selectable parameters further comprise analgorithm buffer size
 6. The apparatus of claim 3, wherein said userselectable parameters further comprise values of at least one pruningconstants.
 7. The apparatus of claim 3, wherein said user selectableparameters further comprise a length of input sequences.
 8. Theapparatus of claim 3 wherein, said user selectable parameters furthercomprise an order of input symbols.
 9. Apparatus according to claim 2,wherein said tree reducer further comprises a path remover operable toremove any path within said tree that is a subset of another path withinsaid tree.
 10. Apparatus according to claim 1, wherein said sequentialdata is a string comprising consecutive symbols selected from a finiteset.
 11. The apparatus of claim 10, further comprising an input stringpermutation unit for carrying out permutations and reorganizations ofthe input string using external information about a process generatingsaid string.
 12. Apparatus according to claim 9, wherein said string isa nucleic acid sequence.
 13. The Apparatus of claim 12, wherein saidstring is a promoter and said tree is operable to identify otherpromoters.
 14. The apparatus of claim 12, wherein said string is astring of coding DNA, and said tree is operable to identify other codingstrings.
 15. The apparatus of claim 12, wherein said string is a stringof non-coding DNA, and said tree is operable to identify othernon-coding strings.
 16. The apparatus of claim 12, wherein said stringis a DNA string and said tree is operable to identify poly-Aterminators.
 17. The apparatus of claim 12, wherein said string has agiven property, and said tree is operable to identify other stringshaving said given property.
 18. Apparatus according to claim 9, whereinsaid string is an amino-acid sequence and the symbols comprise at leastsome of the 20 amino-acids.
 19. The apparatus of claim 9, wherein saidstring is an amino acid string and said tree is operable to identify atleast one of primary, secondary and three dimensional protein structure.20. The apparatus of claim 18, wherein said string has a given property,and said tree is operable to identify other strings having said givenproperty.
 21. Apparatus according to claim 12, wherein said nucleic acidsequence is a promoter sequence and another nucleic acid sequence is anon-promoter sequence, wherein said stochastic modeler is operable tobuild models of said promoter sequence and said non-promoter sequenceand said comparator is operable to compare a third nucleic acid sequencewith each of said models to determine whether said third sequence is apromoter sequence or a non-promoter sequence.
 22. Apparatus according toclaim 12, wherein said nucleic acid sequence is a coding sequence andanother nucleic acid sequence is a non-coding sequence, wherein saidstochastic modeler is operable to build models of said coding sequenceand said non-coding sequence and said comparator is operable to comparea third nucleic acid sequence with each of said models to determinewhether said third sequence is a coding sequence or a non-codingsequence.
 23. Apparatus according to claim 12, wherein said nucleic acidsequence is a repetitive sequence and another nucleic acid sequence is anon-repetitive sequence, wherein said stochastic modeler is operable tobuild models of said repetitive sequence and said non-repetitivesequence and said comparator is operable to compare a third nucleic acidsequence with each of said models to determine whether said thirdsequence is a repetitive sequence or a non-repetitive sequence. 24.Apparatus according to claim 12, wherein said nucleic acid sequence is anon-coding sequence and another nucleic acid sequence is a codingsequence, wherein said stochastic modeler is operable to build models ofsaid non-coding sequence and said coding sequence and said comparator isoperable to compare a third nucleic acid sequence with each of saidmodels to determine whether said third sequence is a non-coding sequenceor a coding sequence.
 25. Apparatus according to claim 12, wherein saidnucleic acid sequence is an exon sequence, wherein said stochasticmodeler is operable to build a model of said exon sequence and saidcomparator is operable to compare a second nucleic acid sequence withsaid model to determine whether said second sequence is an exonsequence.
 26. Apparatus according to claim 12, wherein said nucleic acidsequence is an intron sequence, wherein said stochastic modeler isoperable to build a model of said intron sequence and said comparator isoperable to compare a second nucleic acid sequence with said model todetermine whether said second sequence is an intron sequence. 27.Apparatus according to claim 1, wherein said stochastic model isrefinable using further data sequences, thereby to define a structure ofa common attribute of said data sequences.
 28. Apparatus for determiningstatistical consistency in spatially related data comprising a finiteset of symbols, the apparatus comprising a sequence input for receivingsaid spatially related data, a stochastic modeler for producing at leastone stochastic model from at least part of said spatially related data,and a comparator for comparing said stochastic model with a prestoredmodel, thereby to determine whether there has been a statistical changein said data.
 29. Apparatus according to claim 28, wherein saidstochastic modeler comprises: a tree builder for expressing said symbolsas a series of counters within nodes, each node having a counter foreach symbol, each node having a position within said tree, said positionexpressing a symbol sequence and each counter indicating a number of itscorresponding symbol which follows a symbol sequence of its respectivenode, and a tree reducer for reducing said tree to an irreducible set ofconditional probabilities of relationships between symbols in said inputdata sequence.
 30. Apparatus according to claim 28, said prestored modelbeing a model constructed using another part of said spatially relateddata.
 31. Apparatus according to claim 28, said comparator comprising astatistical processor for determining a statistical distance betweensaid stochastic model and said prestored model.
 32. Apparatus accordingto claim 28, wherein said comparator comprises a statistical processorfor determining a difference in statistical likelihood between saidstochastic model and said prestored model.
 33. Apparatus according toclaim 31, said statistical distance being a relative complexity measure.34. Apparatus according to claim 31, wherein said statistical distancecomprises an SPRT statistic.
 35. Apparatus according to claim 31,wherein said statistical distance comprises an MDL statistic. 36.Apparatus according to claim 31, wherein said statistical distancecomprises a Multinomial goodness of fit statistic.
 37. Apparatusaccording to claim 31, wherein said statistical distance comprises aWeinberger Statistic.
 38. Apparatus according to claim 31, wherein saidstatistical distance comprises a KL statistic.
 39. Apparatus accordingto claim 29, said tree reducer comprising a tree pruner for removingfrom said tree any node whose counter values are within a thresholddistance of counter values of a preceding node in said tree. 40.Apparatus according to claim 39, wherein said threshold distance is userselectable.
 41. The apparatus of claim 40, wherein user selectableparameters further comprise a tree maximum depth.
 42. The apparatus ofclaim 40, wherein user selectable parameters further comprise analgorithm buffer size
 43. The apparatus of claim 40, wherein userselectable parameters further comprise a value of at least one pruningconstant.
 44. The apparatus of claim 40, wherein user selectableparameters further comprise a length of input sequences.
 45. Theapparatus of claim, wherein user selectable parameters further comprisean order of input symbols.
 46. Apparatus according to claim 39, whereinsaid tree reducer further comprises a path remover operable to removeany path within said tree that is a subset of another path within saidtree.
 47. Apparatus according to claim 28, wherein said data comprises anucleic acid sequence.
 48. Apparatus according to claim 28, wherein saiddata comprises an amino-acid sequence.
 49. Apparatus according to claim28, wherein said sequential data is an output of a medical sensorsensing bodily functions.
 50. Apparatus according to claim 47, whereinsaid nucleic acid sequence is a promoter sequence and another nucleicacid sequence is a non-promoter sequence, wherein said stochasticmodeler is operable to build models of said promoter sequence and saidnon-promoter sequence and said comparator is operable to compare a thirdnucleic acid sequence with each of said models to determine whether saidthird sequence is a promoter sequence or a non-promoter sequence. 51.Apparatus according to claim 47, wherein said nucleic acid sequence is acoding sequence and another nucleic acid sequence is a non-codingsequence, wherein said stochastic modeler is operable to build models ofsaid coding sequence and said non-coding sequence and said comparator isoperable to compare a third nucleic acid sequence with each of saidmodels to determine whether said third sequence is a coding sequence ora non-coding sequence.
 52. Apparatus according to claim 47, wherein saidnucleic acid sequence is a repetitive sequence and another nucleic acidsequence is a non-repetitive sequence, wherein said stochastic modeleris operable to build models of said repetitive sequence and saidnon-repetitive sequence and said comparator is operable to compare athird nucleic acid sequence with each of said models to determinewhether said third sequence is a repetitive sequence or a non-repetitivesequence.
 53. Apparatus according to claim 47, wherein said nucleic acidsequence is a non-coding sequence and another nucleic acid sequence is anon-non-coding sequence, wherein said stochastic modeler is operable tobuild models of said non-coding sequence and said non-non-codingsequence and said comparator is operable to compare a third nucleic acidsequence with each of said models to determine whether said thirdsequence is a non-coding sequence or a non-non-coding sequence. 54.Apparatus according to claim 31, wherein said data sequence comprisesimage data of a first image.
 55. Apparatus according to claim 54, saiddistance being indicative of a statistical distribution within saidimage.
 56. Apparatus according to claim 55, further comprising an imagecomparator for comparing said statistical distribution with astatistical distribution of another image.
 57. Apparatus according toclaim 56, said other image being of a same view as said first imagetaken at a different time, said distance being indicative of timedependent change.
 58. Apparatus according to claim 54, said image datacomprising medical imaging data, said statistical distance beingindicative of deviations of said data from an expected norm. 59.Apparatus according to claim 31, applicable to a database to performdata mining on said database.
 60. Apparatus according to claim 28, saidstochastic model being constructed from descriptions of a plurality ofenzymes for carrying out a given task, said model thereby providing ageneric structural description of an enzyme for carrying out said task.61. Apparatus according to claim 28, said model being usable to analyzeresults of a nucleic acid micro array.
 62. Apparatus according to claim28, said model being usable to analyze results of a protein microarray.63. A method of designing a protein for carrying out a predeterminedtask, the method comprising: taking a plurality of proteins known tocarry out said predetermined task, constructing a stochastic model usingan amino acid sequence of said plurality of proteins, using saidstochastic model to predict a protein sequence.
 64. A method ofdesigning a protein for carrying out a predetermined task, the methodcomprising: taking a plurality of proteins known to carry out saidpredetermined task, constructing a stochastic model using the 3Dstructure of said plurality of proteins, using said stochastic model todetermine a protein structure.
 65. A method of distinguishing betweenbiological sequences of a first kind and biological sequences of asecond kind, each kind being expressible in terms of a same finite setof symbols, the method comprising: obtaining a statistically significantset of sequences of said first kind and building a stochastic modelthereof, obtaining a statistically significant set of sequences of saidsecond kind and building a stochastic model thereof, and taking afurther sequence and comparing it with each stochastic model todetermine whether it belongs to either set.
 66. The method of claim 65,wherein said biological sequences are nucleic acid sequences.
 67. Themethod of claim 65, wherein said biological sequences are amino acidsequences.
 68. The method of claim 66, wherein the sequences of saidfirst kind are promoter sequences.
 69. The method of claim 66, whereinthe sequences of said first kind are coding and the sequences of saidsecond kind are non-encoding sequences.
 70. The method of claim 69,wherein the sequences are non-species specific, thereby constructingmodels which are non-species specific.